Conference PaperPDF Available

A Comparative Review and Evaluation of Approximate Adders

May 2015

May 2015

DOI:10.1145/2742060.2743760

Conference: the 25th edition

Authors:

Honglan Jiang

Shanghai Jiao Tong University

Jie Han

University of Alberta

As an important arithmetic module, the adder plays a key role in determining the speed and power consumption of a digital signal processing (DSP) system. The demands of high speed and power efficiency as well as the fault tolerance nature of some applications have promoted the development of approximate adders. This paper reviews current approximate adder designs and provides a comparative evaluation in terms of both error and circuit characteristics. Simulation results show that the equal segmentation adder (ESA) is the most hardware-efficient design, but it has the lowest accuracy in terms of error rate (ER) and mean relative error distance (MRED). The error-tolerant adder type II (ETAII), the speculative carry select adder (SCSA) and the accuracy-configurable approximate adder (ACAA) are equally accurate (provided that the same parameters are used), however ETATII incurs the lowest power-delay-product (PDP) among them. The almost correct adder (ACA) is the most power consuming scheme with a moderate accuracy. The lower-part-OR adder (LOA) is the slowest, but it is highly efficient in power dissipation.

The error-tolerant adder type II (ETAII) [Zhu et al. 2009]: the carry propagates through the two shaded blocks.

…

The speculative carry selection adder (SCSA).

…

The basic arithmetic process of a 4 × 4 unsigned multiplication with possible truncations to a limited width. : an input, a partial product or an output product; : a truncated bit; : a full adder or a half adder.

…

Partial product accumulation of a 4 × 4 unsigned multiplier using a carry-save adder array. HA: a half adder; FA: a full adder.

…

The broken-array multiplier (BAM) with 4 vertical lines and 2 horizontal lines omitted [Mahdiani et al. 2010]. : a carry-save adder cell.

…

Figures - uploaded by Honglan Jiang

Content may be subject to copyright.

Content uploaded by Honglan Jiang

Content may be subject to copyright.

A Review, Classiﬁcation and Comparative Evaluation of Approximate

Arithmetic Circuits

HONGLAN JIANG, University of Alberta

CONG LIU, University of Alberta

LEIBO LIU, Tsinghua University

FABRIZIO LOMBARDI, Northeastern University

JIE HAN, University of Alberta

Often as the most important arithmetic modules in a processor, adders, multipliers and dividers determine

the performance and the energy efﬁciency of many computing tasks. The demand of higher speed and power

efﬁciency, as well as the feature of error resilience in many applications (e.g., multimedia, recognition and

data analytics), have driven the development of approximate arithmetic design. In this article, a review

and classiﬁcation are presented for the current designs of approximate arithmetic circuits including adders,

multipliers and dividers. A comprehensive and comparative evaluation of their error and circuit character-

istics is performed for understanding the features of various designs. By using approximate multipliers and

adders, the circuit for an image processing application consumes as little as 47% of the power and 36% of

the power-delay product of an accurate design while achieving a similar image processing quality. Improve-

ments in delay, power and area are obtained for the detection of differences in images by using approximate

dividers.

CCS Concepts: •General and reference →Surveys and overviews; Evaluation; Measurement;

•Hardware →Arithmetic and datapath circuits; Combinational circuits;

Additional Key Words and Phrases: Approximate computing, approximate circuit, adder, multiplier, divider,

hardware, accuracy, image processing.

1. INTRODUCTION

While computational errors are in general not desirable, applications such as multi-

media, wireless communication, recognition, and data mining are tolerant of the oc-

currence of some errors [Han and Orshansky 2013]. Due to the perceptual limitations

of humans, these errors do not make an obvious difference in applications such as im-

age, audio and video processing. Moreover, in many digital signal processing (DSP)

systems, inputs from the outside world are noisy, so there is a limit in precision or

accuracy in the computed results. Many applications are based on statistical or prob-

abilistic computation, such as classiﬁcation and recognition algorithms. Due to the

nature of these applications, trivial errors in computation do not result in a signiﬁcan-

t performance degradation. Therefore, approximate computing is applicable in many

applications that can tolerate the loss of certain accuracy [Venkatesan et al. 2010].

This work was partly supported by the University of Alberta and the Natural Sciences and Engineering

Research Council (NSERC) of Canada.

Author’s addresses: H. Jiang, C. Liu and J. Han, Department of Electrical and Computer Engineering, U-

niversity of Alberta, Edmonton, AB T6G 1H9, Canada; email: {honglan, cong4, jhan8}@ualberta.ca; L. Liu,

Institute of Microelectronics, Tsinghua University, Beijing 100084, China; email: liulb@tsinghua.edu.cn;

F. Lombardi, Department of Electrical and Computer Engineering, Northeastern University, Boston, MA

02115, USA; email: lombardi@ece.neu.edu.

ACM acknowledges that this contribution was co-authored by an afﬁliate of the national government of

Canada. As such, the Crown in Right of Canada retains an equal interest in the copyright. Reprints must

include clear attribution to ACM and the author’s government agency afﬁliation. Permission to make digital

or hard copies for personal or classroom use is granted. Copies must bear this notice and the full citation on

the ﬁrst page. Copyrights for components of this work owned by others than ACM must be honored. To copy

otherwise, distribute, republish, or post, requires prior speciﬁc permission and/or a fee. Request permissions

from permissions@acm.org.

DOI:

http://dx.doi.org/10.1145/3094124

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

60:2 H. Jiang et al.

Past research on approximate computing has spanned from circuits to programming

languages [Han 2016]. In [Datla et al. 2009], an approximate squaring circuit is pro-

posed. A new logic synthesis approach is introduced to reduce the area of a synthe-

sized circuit for a given threshold of error rate in [Shin 2010]. In [Chippa et al. 2010],

a scalable-effort design approach is proposed to implement highly efﬁcient hardware

for error-resilient applications. Automated design processes have been proposed for

approximate digital circuit design using Cartesian genetic programming [Vasicek and

Sekanina 2015; Mrazek et al. 2016]. Sampson et al. have developed the EnerJ lan-

guage, an extension to Java [Sampson et al. 2011]. This language supports approx-

imate data types for low-power computation. Various computing and memory archi-

tectures have been proposed for supporting approximate computing applications [Es-

maeilzadeh et al. 2012; Miguel et al. 2015]. In this article, we focus on approximate

circuit designs and particularly approximate arithmetic circuits of adders, multipliers

and dividers.

Design metrics and analytical approaches have been proposed for the evaluation of

approximate adders [Liang et al. 2013; Huang et al. 2012; Miao et al. 2012; Venkatesan

et al. 2010; Liu et al. 2015; Mazahir et al. 2017]. Monte Carlo simulation has been

employed to acquire data for analysis. In this article, the error rate (ER), the error

distance (ED) and the average error are used to evaluate the error characteristics

of the approximate designs. Hardware related ﬁgures of merit including critical path

delay, circuit area and power dissipation, as well as compound metrics including the

power-delay product (PDP) and area-delay product (ADP), are utilized to assess the

circuit characteristics of these designs.

Image processing has been essential in diverse applications including multimedia,

biomedical imaging and pattern recognition [Acharya and Ray 2005]. Taking advan-

tage of its inherent error resilience, image processing can be efﬁciently implemented

by using approximate arithmetic circuits. Therefore, image sharpening and change

detection are considered for further evaluation of the approximate circuits in addition

to the evaluation using design metrics. The simulation results show that the image

sharpening circuit using approximate adders and multipliers saves as much as 53% of

power and 58% of area compared to an accurate design with a similar accuracy. The

change detection circuit using approximate dividers achieves as much as 40% improve-

ment in speed and 25% improvement in power compared with an accurate design at a

similar accuracy. While some preliminary results have been presented in [Jiang et al.

2015] and [Jiang et al. 2016b], this article makes the following new contributions.

1. A larger set of approximate adders that include more recent designs are consid-

ered in this article. Moreover, a new error metric (average error) is used to measure

the output bias of an approximate design, which is very important in an accumulative

operation.

2. Approximate Booth multipliers for signed multiplication are included in the eval-

uation, while only unsigned approximate multipliers are considered in [Jiang et al.

2016b].

3. Current approximate dividers are reviewed and their features are discussed with

respect to performance, accuracy and hardware consumption.

4. The STM CMOS 28nm process is used in the circuit synthesis throughout the

work in this article, while an older 65nm technology was used in [Jiang et al. 2015].

5. The considered approximate arithmetic circuits of adders, multipliers and dividers

are applied to two image processing applications for further evaluation. The accuracy

and circuit characteristics are obtained by simulation and synthesis.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

A Review, Classiﬁcation and Comparative Evaluation of Approximate Arithmetic Circuits 60:3

i+1

......

n-1

out

Critical Path

Fig. 1. The n-bit ripple-carry adder (RCA). FA: a 1-bit full adder.

SPG

..... .

SPG

n-1

out

Carry Lookahe ad Generator

n-1

Critical Path

n-1

Fig. 2. The n-bit carry lookahead adder (CLA). SPG: the cell used to produce the sum,

generate (gi=aibi) and propagate (pi=ai+bi) signals.

2. APPROXIMATE ADDERS

An adder performs the addition of two binary numbers. Two basic adders are the

ripple-carry adder (RCA) (Figure 1) and the carry lookahead adder (CLA) (Figure 2).

In an n-bit RCA, the carry of each full adder (FA) is propagated to the next FA, thus

the delay and circuit complexity increase proportionally with n(or O(n)). An n-bit

CLA consists of nunits that operate in parallel to produce the sum and the generate

(gi=aibi) and propagate (pi=ai+bi) signals for generating the lookahead carries.

The delay of CLA is logarithmic in n(or O(log(n))), thus signiﬁcantly shorter than for

RCA. However, a CLA requires a larger circuit area (in O(nlog(n))), incurring a higher

power dissipation.

Many approximation schemes have been proposed by reducing the critical path and

hardware complexity of an accurate adder. An early methodology is based on a specu-

lative operation [Lu 2004; Verma et al. 2008]. In an n-bit speculative adder, each sum

bit is predicted by its previous kless signiﬁcant bits (LSBs) (k < n). As the carry chain

is shorter than n, a speculative adder is faster than a conventional design. A segment-

ed adder is implemented by several smaller adders operating in parallel [Mohapatra

et al. 2011; Zhu et al. 2009; Kahng and Kang 2012; Yang et al. 2016]. Hence, the carry

propagation chain is truncated into shorter segments. Segmentation is also utilized

in [Du et al. 2012; Kim et al. 2013; Ye et al. 2013; Lin et al. 2015; Li and Zhou 2014;

Hu and Qian 2015; Miao et al. 2012; Camus et al. 2015; 2016], but the carry input for

each sub-adder is selected differently. This type of adder is referred to as a carry select

adder. Another method for reducing the critical path delay and power dissipation is

by approximating a full adder [Mahdiani et al. 2010; Gupta et al. 2013; Yang et al.

2013; Cai et al. 2016; Angizi et al. 2017]. The approximate full adder is then used to

implement the LSBs in an accurate adder. Thus, approximate adders are divided into

four categories, as brieﬂy summarized below.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

60:4 H. Jiang et al.

... a

n-1

n-2

n-3

n-4

k k

...

n-1

n-2

...

n-5

...

Fig. 3. The almost correct adder (ACA). : the carry propagation path of the sum.

k-bit Adder k-bit Adder k-bit Adder l-bit Adder

...

l-1:0

n-k-1:n-2k

n-1:n-k

l-1:0

n-1:n-k

n-k-1:n-2k

k+l-1:l

Fig. 4. The equal segmentation adder (ESA). k: the maximum carry chain length; l: the

size of the ﬁrst sub-adder (l≤k).

2.1. Classiﬁcation of Approximate Adders

2.1.1. Speculative adders. The almost correct adder (ACA) [Verma et al. 2008] is based

on the speculative adder design of [Lu 2004]. In an n-bit ACA, kLSBs are used to

predict the carry for each sum bit (n>k), as shown in Figure 3. Therefore, the critical

path delay is reduced to O(log(k)) (for a parallel implementation such as CLA, the

same below). The design in [Lu 2004] requires (n−k)k-bit sub-carry generators in an

n-bit adder and thus, the hardware consumption is rather high (in O((n−k)klog(k))).

This overhead is reduced in [Verma et al. 2008] by sharing some components among

the sub-carry generators.

2.1.2. Segmented adders. The equal segmentation adder (ESA) divides an n-bit adder

into a number of smaller k-bit sub-adders operating in parallel with ﬁxed carry inputs,

so no carry is propagated among the sub-adders (Figure 4) [Mohapatra et al. 2011].

The delay of ESA is O(log(k)) and the circuit complexity is O(nlog(k)). Its hardware

overhead is signiﬁcantly lower than ACA.

The error-tolerant adder type II (ETAII) consists of parallel carry generators and

sum generators [Zhu et al. 2009], as shown in Figure 5. The carry signal from the pre-

vious carry generator propagates to the next sum generator. Therefore, ETAII utilizes

more information to predict the carry and thus it is more accurate than ESA for the

same k. The circuit of ETAII is more complex than that of ESA, and its delay is larger

due to the longer critical path (2k).

In an n-bit accuracy-conﬁgurable approximate adder (ACAA), n

k−12k-bit sub-

adders are required [Kahng and Kang 2012]. Each sub-adder adds 2kconsecutive bits

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

A Review, Classiﬁcation and Comparative Evaluation of Approximate Arithmetic Circuits 60:5

Carry

Generator

Carry

Generator

Sum

Generator

Sum

Generator

...

Sum

Generator

Carry

Generator

k-1:0

n-1:n-k

n-k-1:n-2k

k-1:0

n-k-1:n-2k

n-1:n-k

Fig. 5. The error-tolerant adder type II (ETAII) [Zhu et al. 2009]: the carry propagates

through the two shaded blocks.

Fig. 6. The speculative carry selection adder (SCSA).

with an overlap of kbits and all 2k-bit sub-adders operate in parallel to reduce the

delay to O(log(k)). In each sub-adder, half of the most signiﬁcant sum bits is selected

as the partial sum. The accuracy of ACAA can be conﬁgured at runtime. Moreover,

ACAA has the same carry propagation path as ETAII for each sum, so they are equally

accurate for the same k.

The dithering adder divides an adder into an accurate, more signiﬁcant sub-adder

and a less signiﬁcant sub-adder with upper and lower bounding modules [Miao et al.

2012]. The output of the less signiﬁcant sub-adder is conditionally selected. An effec-

tive “Dither Control” enables a smaller variance in the overall error.

To reduce the error distance, an error control and compensation method is proposed

for a segmented adder in [Yang et al. 2016]. This method employs a multistage laten-

cy to compensate the carry prediction error in a more signiﬁcant segmentation, thus

trading off computing efﬁciency for an improved accuracy.

The delays of the segmented adders are O(log(k)) and the circuit complexities are

O(nlog(k)) for ESA and ETAII, and O((n−k)log(k)) for ACAA.

2.1.3. Carry select adders. In a carry select adder, several signals are commonly used.

For the ith block, generate gi,j =ai,jbi,j , propagate pi,j =ai,j ⊕bi,j , and Pi=

k−1

j=0

pi,j ,

where ai,j and bi,j are the jth LSBs of the input operands. Pi= 1 indicates that all k

propagate signals in the ith block are true.

An n-bit speculative carry select adder (SCSA) consists of m=n

ksub-adders (or

window adders) [Du et al. 2012]. Each sub-adder is made of two k-bit adders: adder0

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

60:6 H. Jiang et al.

with carry-in “0” and adder1 with carry-in “1”. The carry-out of adder0 is connected

to a multiplexer to select the addition result as part of the ﬁnal result, as shown in

Figure 6. SCSA and ETAII achieve the same accuracy for the same value of kdue to

the same carry predict function, while SCSA uses an additional adder and multiplexer

in each block.

Similar to SCSA, an n-bit adder is divided into n

kblocks in the carry skip adder

(CSA) [Kim et al. 2013]. Each block in CSA consists of a sub-carry generator and a sub-

adder. The carry-in of the (i+ 1)th sub-adder is determined by the propagate signals of

the ith block: it is the carry-out of the (i−1)th sub-carry generator when all propagate

signals are true (Pi= 1), otherwise it is the carry-out of the ith sub-carry generator.

Therefore, the critical path delay of CSA is O(log(k)). This carry select scheme im-

proves the carry prediction accuracy.

Different from SCSA, the carry speculative adder (CSPA) in [Lin et al. 2015] contains

one sum generator, two internal carry generators (one with carry-0 and one with carry-

1) and one carry predictor in each block. The output of the ith carry predictor is used

to select carry signals for the (i+ 1)th sum generator. linput bits (rather than k, l < k)

in a block are used in a carry predictor. Therefore, the hardware overhead is reduced

compared to SCSA.

The consistent carry approximate adder (CCA) is similar to SCSA in that each block

of CCA consists of adders with carry-0 and carry-1 [Li and Zhou 2014]. The select

signal of a multiplexer is determined by the propagate signal, i.e., Si= (Pi+Pi−1)SC +

(Pi+Pi−1)Ci−1, where Ci−1is the carry-out of the (i−1)th adder0 and SC is a global

speculative carry. In CCA, the carry prediction depends not only on its LSBs, but also

on the higher bits; its critical path delay is similar to that of SCSA.

The generate signals-exploited carry speculation adder (GCSA) has a similar struc-

ture as CSA and uses the generate signals for carry speculation [Hu and Qian 2015].

The difference between them lies in the carry selection; the carry-in for the (i+ 1)th

sub-adder is selected by its own propagate signals rather than its previous block. The

carry-in is the most signiﬁcant generate signal gi,k−1of the ith block if Pi= 1, or else it

is the carry-out of the ith sub-carry generator. This carry selection scheme effectively

controls the maximum relative error.

In the gracefully-degrading accuracy-conﬁgurable adder (GDA) [Ye et al. 2013], the

control signals are used to conﬁgure the accuracy by selecting an accurate or approx-

imate carry-in signal using a multiplexer for each sub-adder. The delay of GDA is

determined by the carry propagation and thus by the control signals to the multiplex-

ers.

In the carry cut-back adder (CCBA) [Camus et al. 2016], the full carry propagation

is prevented by a controlled multiplexer or an OR gate for a high-speed operation. The

multiplexer is controlled by a carry propagate block at a higher-signiﬁcance position

to cut the carry propagation at a lower-signiﬁcance position. The delay and accuracy of

the CCBA largely depend on the distance between the propagate block and the cutting

multiplexer, thus allowing a high accuracy with a marginal overhead.

The critical path delays of the carry select adders are given by O(log(k)), where kis

the size of the sub-adder.

2.1.4. Approximate full adders. In this type of design, approximate full adders are im-

plemented in the LSBs of a multibit adder. It includes the simple use of OR gates (and

one AND gate for carry propagation) in the so-called lower-part-OR adder (LOA) (Fig-

ure 7) [Mahdiani et al. 2010], the approximate designs of the mirror adder (AMAs)

[Gupta et al. 2013] and the approximate XOR/XNOR-based full adders (AXAs) [Yang

et al. 2013]. Additionally, emerging technologies such as magnetic tunnel junctions

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

A Review, Classiﬁcation and Comparative Evaluation of Approximate Arithmetic Circuits 60:7

a0 b0

al-1 bl-1

...

al-1 bl-1

l-bit OR-based Sub-Adder

(n-l)-bit Accurate

Sub-Adder

al-1:0

bl-1:0

an-1:l

bn-1:l

sl-1

Cin

sl-1:0

sn-1:l

Cout

Fig. 7. The lower-part-OR adder (LOA).

have been considered for the design of approximate full adders for a shorter delay, a

smaller area and a lower power consumption [Cai et al. 2016; Angizi et al. 2017].

The critical path of this type of adders depends on its approximation scheme. For

LOA, it is approximately O(log(n−l)), where lis the number of bits in the lower part

of an adder. In the evaluation, LOA is selected as the reference design because the

other designs require customized layouts at the transistor level; hence, they are not

comparable with the other types of approximate adders that are approximated at the

logic gate level. Finally, an adder with the LSBs truncated is referred to as a truncated

adder that works with a lower precision. It is considered as a baseline design.

2.2. Evaluation of Approximate Adders

2.2.1. Error Characteristics. Monte Carlo simulation is performed to evaluate the accu-

racy of the approximate adders. The error distance (ED) and the relative error dis-

tance (RED) are calculated as: ED =|M0−M|and RE D =ED

M, where M0is the

approximate result and Mis the accurate result [Liang et al. 2013]. The mean error

distance (MED) is the mean of all possible EDs. The error rate (ER, the probability

of producing an incorrect result), the normalized MED (NM ED, the normalization

of MED by the maximum output of the accurate design) and the mean relative error

distance (MRED, the average value of all possible REDs) are used to assess the error

characteristics of the approximate designs. Moreover, the average error (the mean of

all possible errors (M0−M)) is used to evaluate the bias of an approximate arithmetic

design.

The functions of 16−bit approximate adders are simulated by MATLAB using 10

million uniformly distributed random input combinations. Table I shows the simula-

tion results. The size of the carry predictor for CSPA is dk/2ein this evaluation. The

global speculative carry SC for CCA is “0”, which is proved to be more accurate than

using “1”. Additionally, the adder with kLSBs truncated (TruA-k) is simulated for

comparison.

As shown in Table I, ETAII, ACAA and SCSA have the same error characteristics (in

ER,N MED and MRED) due to the same carry propagation chain for each sum bit.

The NM ED and M RED show the same trend, so only M RED and ER are considered

in the comparison, as shown in Figure 8. An equivalent carry propagation chain is

selected for the considered approximate adders i.e., the parameter kfor ACA, ESA,

LOA and TruA is 8, while it is 4 for CSA, GCSA, ETAII, ACAA, SCSA, CCA and CSPA.

These approximate adders are considered as equivalent approximate adders.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

60:8 H. Jiang et al.

Table I. Simulation Results of the Error Characteristics for the Approximate Adders

Adder Type ER (%) N MED (10−3)MRED (10−3) Average Error

Speculative Adders

ACA-4 16.66 7.80 18.90 -1024.6

ACA-5 7.76 3.90 9.60 -511.7

Segmented Adders

ESA-4 85.07 15.70 40.40 -2047.5

ESA-5 80.03 7.80 20.80 -1023.4

ETAII-4 5.85 0.97 2.60 -127.5

ETAII-5 2.28 0.24 0.65 -31.6

ACAA-4 5.85 0.97 2.60 -127.5

ACAA-5 2.29 0.24 0.65 -31.6

Carry Select Adders

SCSA-4 5.85 0.97 2.60 -127.5

SCSA-5 2.28 0.24 0.65 -31.6

CSA-4 0.18 0.06 0.15 -7.4

CSA-5 0.02 0.004 0.01 -0.5

CSPA-4 29.82 3.90 10.40 -511.4

CSPA-5 11.31 0.98 2.70 -128.3

CCA-4 8.71 0.98 2.00 -128.3

CCA-5 3.78 0.25 0.49 -32.2

GCSA-4 4.26 0.48 0.98 -63.2

GCSA-5 1.52 0.12 0.25 -16.1

Approximate Full Adders

LOA-6 82.19 0.09 0.25 0.2

LOA-8 89.99 0.37 1.00 0.2

Truncated Adders

TruA-6 99.98 0.48 1.30 -63.0

TruA-8 100.0 1.95 5.40 -255.0

Note: The number following the name of each approximate adder is the number

of LSBs used for the carry speculation in the speculative adders, the length of

the segmentation in the segmented adders, and the number of approximated

and truncated LSBs in the approximate full adder-based and truncated adders.

Among these approximate adders, CSA is the most accurate, and GCSA is the sec-

ond most accurate in terms of MRED. LOA has a different structure from the other

approximate adders. Its more signiﬁcant part is fully accurate, while the approximate

part is less signiﬁcant. Therefore, the MRED of LOA is rather small, but its E R is

very large. For a similar reason, TruA has the highest ER and very large MRED. The

information used to predict each carry in ESA and CSPA is rather limited, so the ER

and the MRED of ESA and CSPA are larger than most of the other approximate de-

signs. Compared with the other approximate adders, CCA, ETAII, SCSA and ACAA

show moderate ER and M RED. In terms of average error, LOA has the lowest value

because it produces both positive and negative errors that can compensate each oth-

er, while errors are accumulated for the other approximate adders since only negative

errors are generated. Therefore, LOA is suitable for an accumulative operation.

In summary, the carry select adders and the speculative adder (ACA) are very accu-

rate with small values of ER and M RED (except for CSPA using a small number of

bits for carry prediction). Represented by LOA, an approximate full adder based adder

has a moderate MRED, the lowest average error but a very large ER. The segmented

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

A Review, Classiﬁcation and Comparative Evaluation of Approximate Arithmetic Circuits 60:9

0.18 0.68 4.26 5.85 5.85 5.85 8.71

29.82

49.83

89.99

100.00

0.00

20.00

40.00

60.00

80.00

100.00

120.00

CSA ACA GCSA ETAII ACAA SCSA CCA CSPA ESA LOA TruA

ER (%)

(a)

0.1

CSA GCSA LOA ACA CCA ETAII SCSA ACAA ESA TruA CSPA

MRED (%)

(b)

Fig. 8. A comparison of error characteristics of the approximate adders. (a) ER and (b)

MRED. The parameter, k, is 4 for CSA, GCSA, ETAII, ACAA, SCSA, CCA and CSPA,

and it is 8for ACA, ESA, LOA and TruA for an equivalent carry propagation chain.

adders are not very accurate in terms of NM ED and M RED. With very large values

of ER and M RED, the truncated adder is the least accurate among the equivalent

designs. Three different types of approximate adders, ETAII, ACAA and SCSA, have

the same error characteristics.

2.2.2. Circuit Characteristics. To assess the circuit characteristics, 16-bit approximate

adders and the accurate CLA are implemented in VHDL and synthesized using the

Synopsys Design Compiler (DC) based on an STM CMOS 28 nm process with a supply

voltage of 1.0 V at a temperature of 25◦C. For a fair comparison, all designs use the

same process, voltage and temperature with the same optimization option. Both the

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

60:10 H. Jiang et al.

200

400

600

800

1000

1200

Delay (ps)

(a)

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

160.00

180.00

200.00

Power (uW)

(b)

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

PDP (fJ)

(c)

Fig. 9. A comparison of delay and power of the approximate adders (a) delay, (b) power

and (c) power-delay product.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

A Review, Classiﬁcation and Comparative Evaluation of Approximate Arithmetic Circuits 60:11

Table II. Circuit Characteristics of the Approximate Adders

Adder Type Delay

(ps)

Power

(uW )

PDP

(fJ)

Area

(um2)

ADP

(um2.ns)

CLAC 1000 65.9 65.9 60.7 60.7

CLAG 570 105.4 60.1 84.2 48.0

Speculative Adders

ACA-4 250 118.4 29.6 73.8 18.5

ACA-5 270 119.4 32.2 71.8 19.4

Segmented Adders

ESA-4 260 47.0 12.2 49.9 13.0

ESA-5 310 50.6 15.7 51.7 16.0

ETAII-4 550 80.6 44.3 71.6 39.4

ETAII-5 670 78.5 52.6 70.2 47.0

ACAA-4 550 80.9 44.5 70.8 38.9

ACAA-5 650 87.3 56.8 74.6 48.5

Carry Select Adders

SCSA-4 320 134.5 43.0 109.2 34.9

SCSA-5 400 163.0 65.2 126.2 50.5

CSA-4 390 97.8 38.1 142.5 55.6

CSA-5 420 94.3 39.6 131.2 55.1

CSPA-4 300 89.2 26.8 83.7 25.1

CSPA-5 370 117.6 43.5 100.7 37.3

CCA-4 320 172.6 55.2 131.4 42.0

CCA-5 420 209.5 88.0 155.0 65.1

GCSA-4 380 109.7 41.7 74.3 28.2

GCSA-5 460 113.6 52.3 73.3 33.7

Approximate Full Adders

LOA-6 440 75.1 33.0 58.8 25.9

LOA-8 390 66.9 26.1 53.2 20.8

Truncated Adders

TruA-6 390 67.9 26.5 52.4 20.4

TruA-81350 64.2 22.5 46.2 16.2

1TruA-8 is synthesized at a medium mapping effort, different

from the high mapping effort used for the other designs. In

this case, TruA-8 attains a shorter critical path delay, but a

similar PDP and ADP as the results synthesized using the

high mapping effort.

P-channel and the N-channel transistors used in the designs have a typical design

corner with a regular threshold voltage. The critical path delay and area are reported

by the Synopsys DC. Power dissipation is measured by the PrimeTime-PX tool at 1 ns

clock period with 10 million random input combinations. All adders and sub-adders

are implemented as CLA in this article, unless otherwise noted.

Table II reports the results for the delay, power dissipation, power-delay product

(PDP), circuit area and area-delay product (ADP) of the considered adders. Two struc-

tures of the accurate CLA are implemented: CLAC is realized by four cascaded 4-bit

CLAs, while CLAG is realized by four parallel 4-bit CLAs and a carry look-ahead gen-

erator. Among ETAII, SCSA and ACAA (with the same error characteristics when the

same value of kis selected), SCSA, albeit being the fastest, incurs the largest power

dissipation and area because two sub-adders and one multiplexer are utilized in each

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

60:12 H. Jiang et al.

10 20 30 40 50 60 70 80 90 100

PDP (fJ)

10-2

10-1

100

101

102

ER (%)

LOA

ESA

ETAII

GCSA

CSA

CSPA

CCA

LOA

TruA

CSPA

CSA

GCSA

ACA

CCA

LOA

ESA

ETAII

TruA

(a)

0 10 20 30 40 50 60 70 80 90 100

PDP (fJ)

10-2

10-1

100

101

102

MRED (10-3)

ACA

ESA

ETAII

GCSA

CSA

CSPA

CCA

LOA

TruA

CCA

ACA

CSPA

ESA

ETAII

GCSA

LOA

CSA

TruA

(b)

Fig. 10. A comprehensive comparison of the approximate adders: (a) ER and PDP and

(b) MRED and PDP. The parameter kfor LOA and TruA ranges from 9 down to 3 from

left to right, it is 3 to 8 for ESA and ACA, and it is from 3 to 6 for the other adders from

left to right. The adders marked by circles are equivalent in terms of carry propagation

and are thus representatives of different designs.

block. ACAA is very slow due to its long critical path. The block of ETAII (a carry gen-

erator and a sum generator) is signiﬁcantly simpler than those of SCSA and ACAA.

Therefore, ETAII consumes less power and requires a smaller area than SCSA and

ACAA.

In general a circuit with a larger area is likely to consume more power except for CSA

with a relatively low power dissipation but a large area. This is due to its short critical

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

A Review, Classiﬁcation and Comparative Evaluation of Approximate Arithmetic Circuits 60:13

path and enhanced carry select that results in a complex wiring. Figure 9 shows the

delay, power and PDP of the equivalent adders. As expected, the accurate CLAC has

the longest delay among all adders, but not the highest power dissipation. Compared

to CLAC, CLAG is signiﬁcantly faster and consumes more power and area. TruA is not

very fast, but it is the most power and area efﬁcient design. LOA is also very power and

area efﬁcient compared with the other approximate adders. ESA is the slowest, but it

is very power and area efﬁcient due to its simple segmentation structure. CCA is very

fast but is the most power and area consuming design due to its complex speculative

circuit. Both CSPA and GCSA have moderate power dissipations, but CSPA is faster

and GCSA uses a smaller area. Both the speed and power dissipation of CSA are in the

medium range. In terms of PDP and ADP (shown in Table II), they show very similar

trend. TruA, LOA and CSPA have very small values of PDP and ADP, while these

values are relatively large for ACA and CCA (shown in Figure 9(c)).

In conclusion, the carry select adders are likely to have large values of power dis-

sipation and area at a moderate performance. The segmented adders are power and

area efﬁcient. A speculative adder is very fast, but it is also very power consuming

with a moderate area. Conversely, the approximate full adder based adder is slow, but

it consumes a low power and area. The approximate full adders are very efﬁcient in

PDP and ADP, while the speculative adders are not. The truncated adder is very power

and area efﬁcient, but with a relatively long delay.

2.2.3. Discussion. Since the ADP shows a similar trend as the PDP, the PDP is con-

sidered for a comprehensive comparison of the approximate adders, as shown in the

two-dimensional (2-D) plots of Figure 10. CSA-6is accurate due to the precise carry

generated for every block, so the ER and M RED of CSA-6 are 0. Therefore, they are

not shown in Figure 10. The equivalent adders are marked by circles. Among adders

with the same accuracy (ETAII, SCSA and ACAA), ETAII is the most efﬁcient in terms

of delay, power and area. Thus, it is shown as a representative in Figure 10. Compared

with the other approximate adders, CCA has the largest PDP and moderate ER and

MRED. Among the schemes with moderate PDPs (CSPA, GCSA and ETAII), ETAII

and GCSA have moderate MREDs and ERs, while CSPA shows slightly higher val-

ues of these measures. ESA has a rather small PDP, but a considerably large ER and

MRED. ACA has a larger PDP than ESA, but it has both lower ER and MRED. A-

mong all approximate adders, CSA shows the best performance with very small PDP,

ER and M RED values.

With the highest ERs, LOA and TruA show the smallest PDPs for a similar M RED

due to their low power dissipation. In fact, these approximate adders show a decent

tradeoff in error distance and hardware efﬁciency. In particular, they are useful in

applications in which hardware efﬁciency is of the utmost importance.

3. APPROXIMATE MULTIPLIERS

3.1. Classiﬁcation of Approximate Multipliers

Generally, a multiplier consists of stages of partial product generation, accumulation

and a ﬁnal addition, as shown in Figure 11 for a 4×4unsigned multiplication. Let Ai

and Bjbe the ith and jth least signiﬁcant bits of inputs Aand Brespectively, a partial

product Pj,i is usually generated by an AND gate (i.e., Pj,i =AiBj). The commonly

used partial product accumulation structures include the Wallace, Dadda trees and a

carry-save adder array. The Wallace tree for a 4×4unsigned multiplier is shown in

the dotted box of Figure 11. The adders in each layer operate in parallel without carry

propagation, and the same operation repeats until two rows of partial products are left.

For an n-bit multiplier, log(n)layers are required in a Wallace tree. Therefore, the delay

of the partial product accumulation stage is O(log(n)). Moreover, the adders in Figure

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

60:14 H. Jiang et al.

Multiplicand

Multiplier

Partial

products

Products

Wallace

tree

Partial products

generation

Partial products

reduction

Final addition

Fig. 11. The basic arithmetic process of a 4×4unsigned multiplication with possible

truncations to a limited width. : an input, a partial product or an output product; :

a truncated bit; : a full adder or a half adder.

FAFAFA

FAFA

0,0

0,1

0,2

0,3

1,0

1,1

1,2

1,3

2,0

2,1

2,2

2,3

3,0

3,1

3,2

3,3

0,0

0,1

0,2

0,1

0,2

2,0

2,1

2,2

0,0

2,0

3,0

3,1

3,2

3,1

3,2

3,0

HAHAHA HA

0,3

1,0

1,1

1,2

1,1

1,2

1,0

1,3

2,3

3,3

‘0’‘0’‘0’‘0’

Fig. 12. Partial product accumulation of a 4×4unsigned multiplier using a carry-save

adder array. HA: a half adder; FA: a full adder.

11 can be considered as a (3:2) compressor and can be replaced by other counters or

compressors (e.g. a (4:2) compressor) to further reduce the delay. The Dadda tree has

a similar structure as the Wallace tree, but it uses as few adders as possible.

A carry-save adder array is shown in Figure 12; the carry and sum signals generated

by the adders in a row are passed on to the adders in the next row. Adders in a column

operate in series. Hence the partial product accumulation delay of an n-bit multiplier

is approximately O(n), longer than that of the Wallace tree. However, an array requires

a smaller area due to the simple and symmetric structure.

Three main methodologies are used for the approximate design of a multiplier: i)

approximation in generating the partial products [Kulkarni et al. 2011], ii) approxi-

mation (including truncation) in the partial product tree [Mahdiani et al. 2010; Kyaw

et al. 2010; Bhardwaj et al. 2014], and iii) using approximate designs of adders [Liu

et al. 2014], counters [Lin and Lin 2013] or compressors [Ma et al. 2013; Momeni et al.

2015] to accumulate the partial products. For a signed integer operation, Booth multi-

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

A Review, Classiﬁcation and Comparative Evaluation of Approximate Arithmetic Circuits 60:15

Table III. K-Map for the 2×2Underdesigned Multiplier Block.

M2M1M0

B1B0

00 01 11 10

A1A0

00 000 000 000 000

01 000 001 011 010

11 000 011 111 110

10 000 010 110 100

pliers have been widely used due to the fast operation on a reduced number of partial

products. Some recent designs use shifting and addition to obtain the ﬁnal product

by rounding the inputs to a form of 2m(mis a positive integer) [Hashemi et al. 2015;

Zendegani et al. 2017; Mitchell Jr 1962].

Based on the different schemes in approximation, approximate multipliers are clas-

siﬁed into three unsigned types and signed Booth multipliers. Following this classiﬁ-

cation, existing designs of approximate multipliers are brieﬂy reviewed next.

3.1.1. Approximation in generating partial products. The underdesigned multiplier (UDM)

utilizes an approximate 2×2multiplier obtained by altering a single entry in the Kar-

naugh Map (K-Map) of its function (as highlighted in Table III) [Kulkarni et al. 2011].

Table III shows the K-Map of the approximate 2×2multiplier, where A1A0and B1B0

are the two 2-bit inputs, and M2M1M0is the 3-bit output. In this approximation, the

accurate multiplication result “1001” is simpliﬁed to “111” to save one output bit when

both the inputs are “11”. Assuming the value of each input bit is equally likely, the

error rate of the 2×2multiplier is then (1

2)4=1

16 . Larger multipliers can be designed

based on the 2×2multiplier. This multiplier introduces an error when generating the

partial products, however the adder tree remains accurate.

3.1.2. Approximation in the partial product tree. A bio-inspired imprecise multiplier re-

ferred to as a broken-array multiplier (BAM) is proposed in [Mahdiani et al. 2010]. The

BAM operates by omitting some carry-save adders in an array multiplier in both hori-

zontal and vertical directions (Figure 13). A more straightforward approach to trunca-

tion is to truncate some LSBs on the input operands and thus, a smaller multiplier is

sufﬁcient for the remaining MSBs. This truncated multiplier (TruM) is considered as

a baseline design.

The error tolerant multiplier (ETM) is divided into a multiplication section for the

MSBs and a non-multiplication section for the LSBs [Kyaw et al. 2010]. Figure 14

shows the architecture of a 16-bit ETM. A NOR gate based control block is used to

deal with the following two cases: i) if the product of the MSBs is zero, then the upper

accurate 8-bit multiplier is activated to multiply the LSBs without any approximation,

and ii) if the product of the MSBs is nonzero, the non-multiplication section is used

as an approximate multiplier to process the LSBs, while the multiplication section is

activated to accurately multiply the MSBs.

The static segment multiplier (SSM) was proposed using a similar partition scheme

[Narayanamoorthy et al. 2015]. Different from ETM, no approximation is applied to

the LSBs in the SSM. Either the MSBs or the LSBs of the operands are accurately

multiplied depending on whether its MSBs are all zeros. [Liu et al. 2017a] has shown

that only small improvements in accuracy and hardware are achieved compared to

ETM, thus this design is not further considered in this comparison study.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

60:16 H. Jiang et al.

Vertical Break

Level (VBL)

Horizontal Break

Level (VBL)

Horizontally-omitted cell Vertically-omitted cell

Fig. 13. The broken-array multiplier (BAM) with 4 vertical lines and 2 horizontal lines

omitted [Mahdiani et al. 2010]. : a carry-save adder cell.

Standard 8-bit Multiplier

LSB Non-multip lication

block

Standard 8-bit Multiplier

-A

-B

-P

-A

-B

-A

-B

-P

Control Block

-A

-B

Fig. 14. The 16-bit error-tolerant multiplier (ETM) of [Kyaw et al. 2010].

A power and area-efﬁcient approximate Wallace tree multiplier (AWTM) is based

on a bit-width aware approximate multiplication and a carry-in prediction method [B-

hardwaj et al. 2014]. An n-bit AWTM is implemented by four n/2-bit sub-multipliers,

as shown in Figure 15, where the most signiﬁcant sub-multiplier AHBHis further im-

plemented by four n/4-bit sub-multipliers. The AWTM is conﬁgured into four different

modes by the number of approximate n/4-bit sub-multipliers in the most signiﬁcant

n/2-bit sub-multiplier, while the other three multipliers (AHBL,ALBHand ALBL) are

approximate. The approximate partial products are then accumulated by a Wallace

tree.

3.1.3. Using approximate counters or compressors in the partial product tree. An approximate

(4:2) counter is proposed in [Lin and Lin 2013] for an inaccurate 4-bit Wallace multi-

plier. Table IV shows the K-Map of the approximate (4:2) counter, where X1· · · X4are

the four input signals of a (4:2) counter (i.e., the partial products in the partial product

tree of a multiplier), Cand Sare the carry and sum, respectively. The values of CS in

the box are approximated as “10” for “100” in the approximate counter when all input

signals are “1.” As the probability of obtaining a partial product of “1” is 1

4, the error

rate of the approximate (4:2) counter is (1

4)4=1

256 . The inaccurate 4-bit multiplier is

then used to construct larger multipliers with error detection and correction circuits.

This approximate multiplier is referred to as ICM.

In the compressor based multiplier, accurate (3:2) and (4:2) compressors are im-

proved to speed up the partial product accumulation [Baran et al. 2010]. By using the

improved compressors, better energy and delay characteristics are obtained for a mul-

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

A Review, Classiﬁcation and Comparative Evaluation of Approximate Arithmetic Circuits 60:17

Final Product (2n b its)

n bits

n/2 bits

n/4 bits

Fig. 15. The approximate Wallace tree multiplier (AWTM) [Bhardwaj et al. 2014]. n

is the width of the multiplier, AHBH,ALBH,AHBLand ALBLare partial products

generated by the n/2-bit sub-multipliers, AHH BH H ,AHLBHH ,AH H BHL and AH LBH L

are partial products generated by the n/4-bit sub-multipliers.

Table IV. K-Map for the 4 : 2 Approximate Counter

CS X1X0

00 01 11 10

X3X4

00 00 01 10 01

01 01 10 11 10

11 10 11 10 11

10 01 10 11 10

tiplier. To further reduce delay and power, two approximate (4:2) compressor designs

(AC1 and AC2) are presented in [Momeni et al. 2015]; these compressors are used in

a Dadda multiplier with four different schemes. Approximate counters in which the

more signiﬁcant output bits are ignored, are presented and evaluated in [Kelly et al.

2009]. Several signed multipliers are then implemented using these approximate coun-

ters. The more accurate schemes 3 and 4 of the approximate compressor based mul-

tiplier (referred to as ACM-3 and ACM-4) in [Momeni et al. 2015] are considered for

comparison.

In the approximate multiplier with conﬁgurable error recovery, the partial product-

s are accumulated by a novel approximate adder (Figure 16) [Liu et al. 2014]. The

approximate adder utilizes two adjacent inputs to generate a sum and an error bit.

The adder processes data in parallel, thus no carry propagation is required. Two ap-

proximate error accumulation schemes are then proposed to alleviate the error of the

approximate multiplier (due to the approximate adder). OR gates are used in the ﬁrst

error accumulation stage in scheme 1 (AM1), while in scheme 2 (AM2), both OR gates

and the approximate adders are used. The truncation of 16 LSBs in the partial prod-

ucts in AM1 and AM2 results in TAM1 and TAM2 respectively [Liu 2014].

3.1.4. Approximate Booth multipliers. The Booth recoding algorithm handles binary num-

bers in 2’s complement. The modiﬁed (or radix-4) Booth algorithm is commonly used

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

60:18 H. Jiang et al.

Fig. 16. The approximate adder cell. Si: the sum bit; Ei: the error bit [Liu et al. 2014].

0,0

0,2

0,4

0,6

0,1

0,3

0,5

0,7

0,8

1,0

1,2

1,4

1,6

1,1

1,3

1,5

1,7

1,8

2,0

2,2

2,4

2,6

2,1

2,3

2,5

2,7

2,8

3,0

3,2

3,4

3,6

3,1

3,3

3,5

3,7

3,8

Main Part (MP) Truncation Part (TP)

major

minor

exact

carry

approximate

carry

Fig. 17. The partial products for an 8-bit ﬁxed-width modiﬁed Booth multiplier [Cho

et al. 2004]. Pi,j is the jth partial product in the ith partial product vector and niis the

sign of the ith partial product vector.

due to its ease in generating partial products. Little work has been reported for

approximate Booth multipliers, whereas the ﬁxed-width Booth multiplier utilizes a

truncation-based approach has been studied for more than a decade. The conventional

post-truncated ﬁxed-width multiplier generates an output with the same width as the

input operand by truncating the lower half of the product. Truncation of half of the

partial products is widely used because the post-truncated scheme does not achieve a

signiﬁcant circuit advantage over the accurate multiplier. A direct truncation of partial

products incurs a large error, so many error compensation schemes have been proposed

[Cho et al. 2004; Min-An et al. 2007; Wang et al. 2011; Chen and Chang 2012]. Anoth-

er approach is to use an approximate Booth encoder with a simple circuit [Liu et al.

2017b]. Most of the approximate Booth multipliers are based on the modiﬁed Booth

algorithm; the partial products are accumulated by an array structure in [Cho et al.

2004; Min-An et al. 2007; Wang et al. 2011; Farshchi et al. 2013] while a parallel carry-

save-adder tree is used in [Chen and Chang 2012; Liu et al. 2017b]. The approximate

Booth multiplier in [Jiang et al. 2016a] is based on the radix-8 Booth algorithm.

Figure 17 shows the partial products of an 8-bit ﬁxed-width modiﬁed Booth multi-

plier with error compensation [Cho et al. 2004]. The ﬁnal product is the addition of the

main part (MP) and the carry signals generated in the truncation part (TP). The carry

signals are approximated by the output of Booth encoders. The approximate carry σis

σ=$2−1(

n/2−2

i=0

zeroi+ 1)%, where nis the multiplier width, and zeroiis “1” if the ith

partial product vector is not zero or zeroi= 0 otherwise. This multiplier is referred to

as BM04.

The multiplier in [Min-An et al. 2007] can adaptively compensate the quantization

error by keeping different numbers of the most signiﬁcant columns of the partial prod-

ucts (ω(ω≥0)). Two types of binary thresholding are proposed for error compensation.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

A Review, Classiﬁcation and Comparative Evaluation of Approximate Arithmetic Circuits 60:19

Different from BM04, nrather than (n−1) columns of partial products are truncated

for an n-bit multiplier. The error compensation for each type of binary thresholding

varies with the value of ωand the partial products of the ωth column in the truncation

part (from left to right). This multiplier is denoted as BM07.

The multiplier presented in [Wang et al. 2011] uses ncolumns of partial products

in the truncation part for an n-bit multiplier; the most signiﬁcant one column in the

truncation part is reserved for error compensation. The error compensation using a

simpliﬁed sorting network signiﬁcantly reduces the mean and mean-squared errors by

making the error symmetric as well as centralizing the error distribution around zero.

This design is referred to as BM11.

A ﬁxed-width Booth multiplier was designed based on a probabilistic estimation

bias in [Chen and Chang 2012]. Therefore, this multiplier is referred to as PEBM.

The number of columns of the accumulated partial products varies in accordance

with the desired trade-off between hardware and accuracy. The error compensation

formula is derived from a probability analysis rather than a time-consuming ex-

haustive simulation. The carry generated by the truncation part is approximated by

σ=$2−1(

n/2−1−bω/2c

i=0

zi−1)%, where zi=P0,n/2−1+nn/2−1when iis n/2−1and

zi=zeroiotherwise.

Based on BAM, the broken Booth multiplier (BBM) uses a modiﬁed Booth algorithm

to generate partial products and omits carry-save adders to the right of a vertical line

[Farshchi et al. 2013]. BBM has a smaller power-delay product (PDP) for the same

mean-squared error compared to BAM.

An approximate recoding adder is proposed in [Jiang et al. 2016a] for calculating

the triple multiplicands to reduce the additional delay encountered in a radix-8 Booth

multiplier. A Wallace tree and a truncation technique are then utilized for partial prod-

uct accumulation to reduce power and delay. The most efﬁcient ﬁxed-width multiplier

ABM2_R15 is considered in this comparison and referred to as ABM2 in this article for

simplicity. In [Liu et al. 2017b], two approximate Radix-4 Booth encoders are designed

for the partial product generation by simplifying the exact K-Map. The generated par-

tial products are then accumulated by using exact 4-2 compressors.

3.2. Evaluation of Approximate Multipliers

3.2.1. Error Characteristics. The considered (16 ×16) approximate multipliers are simu-

lated by MATLAB with 10 million uniformly distributed random input combinations.

The ER,N MED,M RED and average error are obtained and shown in Table V. TruM-

krepresents the truncated multiplier with kLSBs truncated in the input operands.

According to Table V, most of the designs, especially those with truncation, have

large ERs close to 100%. However, ICM has a relatively low ER of 5.45%, because it

uses just one approximate counter in a 4×4sub-multiplier with an error rate of only

256 . UDM also shows a lower ER than the other approximate multipliers. In terms of

the average error, ACMs have the smallest value, while the average errors for all the

other approximate unsigned multipliers show the same trend with the NM ED. This

is because ACMs produce both positive and negative errors, but the other approximate

unsigned multipliers produce either negative or positive errors.

Figure 18 shows the NMEDs and M RE Ds of the equivalent approximate multipli-

ers that are conﬁgured to have 16-bit accurate MSBs (except for ICM and UDM that

have only one conﬁguration). Thus, the truncated LSBs in the partial product is 16 for

BAM, the number of MSBs used for error compensation is 16 for AM1, AM2, TAM1

and TAM2, the size of the accurate sub-multiplier is 8 for ETM, 8 LSBs are truncat-

ed for TruM, and the mode number of ACM and AWTM is 4. Among the unsigned

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

60:20 H. Jiang et al.

Table V. Error Characteristics of the Approximate Multipliers

Multiplier Type ER (%) N MED (10−3)MRED (%) Average Error (104)

Multipliers with Approximation in Generating Partial Products

UDM 80.99 13.92 3.33 -5974.9

Multipliers with Approximation in the Partial Product Tree

BAM-16 99.99 0.06 0.21 -24.6

BAM-17 99.99 0.11 0.36 -49.2

BAM-18 99.99 0.22 0.63 -95.0

BAM-20 100.00 0.79 1.79 -337.5

ETM-7 99.99 0.97 1.56 -413.4

ETM-8 100.00 1.94 2.85 -825.1

AWTM-1 100.00 2.70 75.00 1159.6

AWTM-2 100.00 1.88 39.37 807.4

AWTM-3 99.99 0.12 2.51 51.5

AWTM-4 99.94 0.02 0.33 8.6

Multipliers using Approximate Counters or Compressors

ICM 5.45 0.29 0.06 -124.2

ACM-3 99.99 0.01 0.29 3.95

ACM-4 99.97 0.01 0.26 1.44

AM1-13 99.38 0.90 0.54 -385.5

AM1-16 98.22 0.81 0.34 -347.9

AM2-10 99.64 0.88 1.20 -379.8

AM2-13 99.36 0.35 0.34 -148.9

AM2-16 97.96 0.27 0.13 -115.1

TAM1-13 99.99 1.14 0.77 -488.5

TAM1-16 99.99 1.06 0.58 -457.1

TAM2-10 99.99 0.90 1.27 -384.2

TAM2-13 99.99 0.36 0.41 -153.3

TAM2-16 97.99 0.28 0.22 -121.0

Truncated Unsigned Multipliers

TruM-4 99.61 0.11 0.23 -49.2

TruM-8 100.0 1.94 2.85 -834.1

Approximate Booth Multipliers

PEBM 99.99 0.023 0.27 -1.02

BBM 100.00 0.092 0.57 -9.83

BM11 99.99 0.022 0.18 -0.003

BM07 99.99 0.024 0.16 -1.56

BM04 99.99 0.027 0.48 -2.66

ABM2 99.99 0.034 0.44 -0.614

Note: The parameter kfollows the acronym of each approximate multiplier. For AM1,

AM2, TAM1 and TAM2, this parameter refers to the number of MSBs used for error

reduction and for ETM, the number of LSBs in the inaccurate part. It is the mode

number in AWTM and ACM, and the vertical broken length (VBL) for BAM.

approximate multipliers, UDM has the largest NMED while ACM has the smallest.

ICM, AM2 and TAM2 have similar values of NMED, however ICM has the small-

est MRED, while the M RED of TAM2 is the largest. Therefore ICM has the highest

accuracy in terms of MRED, while TAM2 is the least accurate among these three ap-

proximate multipliers. This indicates that multipliers with simple truncation tend to

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

A Review, Classiﬁcation and Comparative Evaluation of Approximate Arithmetic Circuits 60:21

0.0001

0.001

0.01

0.1

NMED (%)

(a)

0.01

0.1

MRED (%)

(b)

Fig. 18. Comparison of NM ED and MRED of the approximate multipliers with in-

creasing (a) NM ED and (b) M RED. ACM and AWTM represent ACM-4 and AWTM-4,

respectively. The truncated number of LSBs in the partial product is 16 for BAM, the

number of MSBs used for error compensation is 16 for AM1, AM2, TAM1 and TAM2,

and 8 LSBs are truncated for TruM. ETM is ETM-8 in Table V.

have larger MREDs when their N M E Ds are similar. BAM has moderate values of

NM ED and M RED, while ETM and TruM have both large M RE D and N M ED.

Hence, ICM is the most accurate design with the lowest ER,M RED and a moder-

ate NM ED. ACM, AWTM, BAM, AM2 and TAM2 also show good accuracy among all

considered approximate multipliers with both low NM EDs and M REDs. ETM, TruM

and UDM are not very accurate in terms of these metrics.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

60:22 H. Jiang et al.

For the approximate Booth multipliers in Table V, a column of the most signiﬁcant

partial products in the truncation part (adjacent to the MP part) is kept for PEBM,

BM07 and BM04. 15-bit columns of partial products are truncated in BBM and ABM2

to keep the same width of the output as the other designs. The ERs of all approximate

multipliers are close to 100% due to truncation. Most designs have similar NMEDs

except for BBM. BBM has the largest NM ED and M RED, because there is no error

compensation. BM07 and BM11 have very small MRED values, while PEBM has a

slightly larger value. BM11 has the lowest average error and the average error of

ABM2 is also very small.

In summary, as a multiplier approximated in generating the partial products, UDM

has very large values of NM ED and M RED, and a relatively small ER. The multipli-

ers approximated in the partial product tree mostly have moderate NMEDs and very

large MREDs (except for BAMs with fewer than or equal to 18 truncated bits and

AWTM-4). The multipliers approximated using approximate counters or compressors

have small values of both NMED and MRED, while the multiplier truncated on the

input operands have large values of both metrics (when the truncated number of LSBs

is larger than 4). Among the considered approximate Booth multipliers, BBM shows

the lowest accuracy in terms of both NM ED and M RED. BM11 has the smallest aver-

age error. The other approximate Booth multipliers show similar N MEDs and various

MREDs.

3.2.2. Circuit Characteristics. The 16 ×16 approximate multipliers are implemented in

VHDL and synthesized using the same tool and process as in the simulation of ap-

proximate adders. The only difference is that the clock period is 4 ns for the power es-

timation of the multipliers because of a longer critical path delay. The accurate Wallace

multiplier (WallaceM) optimized for speed [Oklobdzija et al. 1996] and array multiplier

(ArrayM) are also simulated for comparison. To reduce the effect of the ﬁnal addition,

the same multi-bit adder in the tool library is utilized in all approximate multiplier

designs as the ﬁnal adder. Table VI shows the critical path delay, area, power, PDP

and ADP of the considered multipliers. TruMA and TruMW are the truncated array

and Wallace multipliers, respectively.

Figure 19 shows the comparison of delay, power and PDP of the equivalent approxi-

mate multipliers. The accurate array multiplier (ArrayM) is the slowest and the Wal-

lace multiplier (WallaceM) consumes more area (as per Table VI); this is consistent

with the theoretical analysis. Due to the expressively fast carry-ignored operation,

AM1/TAM1, AM2/TAM2 have smaller delays compared to most of the other designs.

BAM is signiﬁcantly slower due to its array structure. AWTM, UDM, ICM and ACM

have larger delays than the other approximate multipliers. BAM consumes a very low

power, the power consumptions of AWTM and ACM are in the medium range, while

UDM and ICM incur a relatively high power consumption. TruMA, TruMW and ETM

have both a short delay and a low power dissipation.

A multiplier with a higher power dissipation usually has a larger area and thus larg-

er PDP and ADP. In terms of power and area, TruMA, TruMW, ETM, TAM1/TAM2 and

BAM are among the best designs. A common feature of these designs is that they all

use truncation, which can signiﬁcantly affect the MRED while the NM ED may not

be signiﬁcantly changed. If most of the inputs have large values, the error introduced

by truncation can be tolerated; thus truncation is a useful scheme to save area and

power. Otherwise, truncation-based designs may yield unacceptably inaccurate result-

s. Without truncation, a multiplier whose design is approximated in generating partial

product (e.g. UDM) tends to have a large delay, power and area. These measures for

the multipliers approximated in the partial product tree (e.g. AWTM) are moderate,

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

A Review, Classiﬁcation and Comparative Evaluation of Approximate Arithmetic Circuits 60:23

Table VI. Circuit Characteristics of the Approximate Multipliers

Multiplier Type Delay

(ns)

Power

(uW )

PDP

(fJ)

Area

(um2)

ADP

(um2.ns)

ArrayM 2.58 477.4 1,231.7 921 2,375.7

WallaceM 2.03 461.3 936.4 934 1,896.0

Multipliers with Approximation in Generating Partial Products

UDM 2.01 352.7 708.9 829 1666.7

Multipliers with Approximation in the Partial Product Tree

BAM-16 2.34 221.3 517.8 441 1,031.9

BAM-17 2.17 189.5 411.2 384 833.3

BAM-18 1.99 161.1 320.6 331 658.7

BAM-20 1.65 111.0 183.2 237 390.4

ETM-7 1.57 140.5 220.6 349 547.6

ETM-8 1.50 108.5 162.8 288 431.4

AWTM-1 1.69 247.8 418.8 640 1,081.8

AWTM-2 1.69 259.4 438.4 665 1,123.7

AWTM-3 1.69 270.3 456.8 690 1,165.6

AWTM-4 1.74 280.0 478.2 715 1,243.2

Multipliers using Approximate Counters or Compressors

ICM 1.87 367.4 687.0 937 1,751.4

ACM-3 1.97 279.5 550.6 738 1,454.5

ACM-4 2.00 284.1 568.2 724 1,447.0

AM1-13 1.38 355.4 490.5 819 1,128.8

AM1-16 1.57 380.6 597.5 878 1,378.5

AM2-10 1.29 336.8 434.5 816 1,052.6

AM2-13 1.53 364.2 557.2 919 1,406.1

AM2-16 1.71 400.4 684.7 1,051 1,797.2

TAM1-13 1.31 192.0 251.5 460 602.6

TAM1-16 1.45 214.6 311.2 516 748.2

TAM2-10 1.23 180.2 221.6 477 586.7

TAM2-13 1.48 212.5 314.5 584 864.3

TAM2-16 1.62 244.9 396.7 693 1,122.7

Truncated Unsigned Multipliers

TruMA-4 1.89 243.5 460.2 503 950.6

TruMA-8 1.19 92.1 109.6 211 250.5

TruMW-4 1.62 262.4 425.1 561 908.2

TruMW-8 1.10 98.4 108.2 239 263.0

Approximate Booth Multipliers

PEBM 1.83 264.3 483.7 528 966.2

BBM 1.91 250.3 478.1 487 930.2

BM11 1.96 258.1 505.9 475 931.0

BM07 2.03 270.4 548.9 528 1071.8

BM04 2.05 249.8 512.1 447 916.4

ABM2 2.25 208.0 468.0 424 954.0

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

60:24 H. Jiang et al.

0.00

0.50

1.00

1.50

2.00

2.50

3.00

Delay (ns)

(a)

0.00

100.00

200.00

300.00

400.00

500.00

600.00

Power (uW)

(b)

200

400

600

800

1000

1200

1400

PDP (fJ)

(c)

Fig. 19. A comparison of delay, power and PDP of the approximate multipliers. (a)

delay, (b) power and (c) power-delay product.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

A Review, Classiﬁcation and Comparative Evaluation of Approximate Arithmetic Circuits 60:25

100 200 300 400 500 600 700 800

PDP (fJ)

10-1

100

MRED (%)

AM1-16

AM1-13

TAM1

AM2-16

AM2-13

AM2-10

TAM2

ETM-8

ETM-7

AWTM-4

AWTM-3

ICM

UDM

BAM

ACM-3

ACM-4

TruMA

TruMW

Fig. 20. MRED and PDP of the approximate unsigned multipliers. The parameter k

for TruMA and TruMW is from 8 down to 2 from left to right; it is 21 down to 13

for BAM, and 10 to 16 for TAM1 and TAM2. The multipliers marked by circles are

equivalent in terms of the number of accurate MSBs and are thus representatives of

different designs.

while the multipliers using approximate counters or compressors (ICM, ACM, AM1,

AM2) require higher power and area.

In terms of PDP (Figure 19(c)) and ADP, TruMA, TruMW, ETM, TAM1 and TAM2

have very small values, while ICM, UDM and AM2 are the opposite. The values of PDP

and ADP for AM1, ACM, BAM and AWTM are in the medium range.

For the approximate Booth multipliers, PEBM is the fastest, but it is very power and

area consuming due to the use of a carry save adder tree for the parallel accumulation.

ABM2 is the slowest but the most power and area efﬁcient design due to the smaller

number of partial products generated by the time-consuming recoding adder in the

radix-8 algorithm. Thus, it has the smallest PDP and a moderate value of ADP. With

no error compensation, BBM shows a small delay, low power dissipation and small

circuit area, and thus smaller PDP and ADP compared with most of the other designs

(except ABM2 and BM04). BM11 and BM04 have similar values for all circuit metrics.

BM07 has a similar delay, but with a higher power and area and thus a larger PDP

and ADP.

3.2.3. Discussion. MRED and PDP are jointly considered next for an overall evalua-

tion of the approximate multipliers, as shown in Figure 20 and Figure 21.

Figure 20 shows that TruMW has a smaller PDP than TruMA when the same num-

ber of LSBs is truncated. Among the truncation-based designs, the truncated unsigned

multipliers (TruMA and TruMW) are slightly more accurate (in MRED) than BAM

and ETM for a similar PDP. TruMW has a smaller M RED than most other approxi-

mate designs (except TAM1, TAM2 and ICM).

In Figure 20, TAM1-13, TAM1-16, TAM2-13, TruMA-6, TruMW-6 and BAM-18 have

both small PDPs and MREDs. Most of the other designs have at least one major short-

coming. ICM and ACM incur a very low error, but their PDPs are very high. Other than

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

60:26 H. Jiang et al.

0.44 0.46 0.48 0.5 0.52 0.54 0.56 0.58

PDP (pJ)

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

MRED (%)

BM04

BM11

ABM2

PEBM

BM07

BBM

Fig. 21. MRED and PDP of the approximate Booth multipliers.

the truncated designs, ETM-8 has the smallest PDP but with a rather large MRED.

UDM shows a poor performance in both PDP and MRED. Even though some BAM

conﬁgurations have small PDPs, their delays are generally large (Figure 19(a)); more-

over, some BAM conﬁgurations have low accuracies. AWTMs have large PDPs and only

AWTM-4 has a high accuracy.

As for the approximate Booth multipliers (Figure 21), ABM2 shows the lowest PDP

and a moderate accuracy. BM11 and BM07 are very accurate in terms of M RED but

with relatively poor PDPs. PEBM shows both a moderate PDP and MRED.

4. APPROXIMATE DIVIDERS

4.1. Classiﬁcation

The divider is a less frequently used arithmetic module compared to the adder and

multiplier; therefore, less research has been pursued on an approximate design.

Two methodologies have been advocated for sequential division, the digit recurren-

t algorithm [Liu and Nannarelli 2012] and the functional iterative algorithm (e.g.,

using the Newton-Raphson algorithm [Flynn 1970]). A sequential divider has a low

hardware complexity, however its delay is considerably longer than an adder and a

multiplier, so it signiﬁcantly affects the overall performance of a processor. Thus, di-

viders made of combinational logic circuits are discussed in this article. Like multipli-

cation, division can also be implemented by an array structure, in which adder cells

are replaced by subtractor cells. Several approximations are made on the array divider

while retaining a low-power [Chen et al. 2015; 2016; Chen et al. 2017] and high-speed

[Hashemi et al. 2016] operation. In addition, different approximate divider designs

based on rounding [Zendegani et al. 2016] and curve ﬁtting [Low and Jong 2013; Wu

and Jong 2015] are also proposed.

4.1.1. Approximate array dividers. Four types of approximate unsigned integer non-

restoring divider (AXDnr) are presented in [Chen et al. 2015]. Three approximate

subtractors (AXCSs) are designed for the array of an unsigned divider by simplifying

the circuit of an exact subtractor cell. The AXCSs are then used to replace the exact

subtractor cells at the least signiﬁcant vertical, horizontal, square or triangle cells of

the array divider. Moreover, a truncation scheme is utilized by discarding the approx-

imate subtractors for comparison. Based on the same theory and design, four types

of approximate restoring dividers (AXDr) are further proposed by using the proposed

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

A Review, Classiﬁcation and Comparative Evaluation of Approximate Arithmetic Circuits 60:27

AXCSs [Chen et al. 2016]. It has been shown that AXDrs are slightly more accurate

and consumes lower power than AXDnrs.

To make the remaining subtractor cells more efﬁcient, a dynamic approximate di-

vider (DAXD) is proposed by dynamically selecting the inputs of the subtractor cells

[Hashemi et al. 2016]. DAXD selects a ﬁxed number of bits in the input operands from

the most signiﬁcant non-zero bit, and then truncates the least signiﬁcant bits. There-

fore, it can be implemented by two leading-one detectors, two multiplexers, a smaller

array divider and a barrel shifter.

To further improve the performance and power efﬁciency, an approximate signed-

digit adder is proposed for high-radix dividers [Chen et al. 2017]. Compared to the con-

ventional radix-2 design, the approximate radix-4 and radix-8 dividers show a higher

speed and consume lower power, albeit with a slightly lower accuracy.

4.1.2. Curve ﬁtting based approximate dividers. A widely used methodology to reduce the

hardware overhead of a divider is based on binary logarithms, that is to obtain the

antilogarithmic value of the difference between the logarithmic values of the dividend

and the divisor . Mitchell et al. ﬁrst developed the logarithm approximation of a binary

number by shifting and counting [Mitchell Jr 1962]. Inspired by it, a new antilogarith-

mic algorithm was proposed using a piecewise linear approximation [Low and Jong

2013]. A high-speed divider (HSD) based on this algorithm was then presented. The

antilogarithmic algorithm is directly approximated from the input operands, thus on-

ly look-up tables and multiplications are required, i.e., no logarithmic or subtraction

operation is needed. HSD achieves a better accuracy and a much higher speed (but at

a larger area) than the divider implemented directly by Mitchell’s algorithm.

A similar curve ﬁtting approach was used for the design of a ﬂoating-point divider

(FPD) [Wu and Jong 2015]. FPD partitions the curved surfaces of the quotient into

several square or triangular regions and linearly approximates each region by curve

ﬁtting. Finally, the division of the mantissas is implemented by a comparison module, a

look-up table, shifters and adders. This approximate divider achieves a better accuracy

than that in [Low and Jong 2013] with similar circuit characteristics.

4.1.3. Rounding based approximate dividers. A high-speed and energy-efﬁcient rounding-

based approximate divider (SEERAD) is presented in [Zendegani et al. 2016]. It trans-

forms the division to a smaller multiplication by rounding the divisor Bto a form of

2K+L/D, where Kshows the bit position of the most signiﬁcant “1” of B(K=blog2Bc),

and Land Dare constant integers found from an exhaustive simulation by the con-

dition of obtaining the lowest mean relative error. Different accuracy levels are con-

sidered to improve the accuracy of SEERAD by varying Dand Lwith combinations of

the more relevant bits of Bafter the most signiﬁcant “1”. The multiplier in SEERAD

is implemented by several shift units and an adder block. Thus, SEERAD is very fast.

4.2. Discussion

The approximate array dividers, AXDnr, AXDr and DAXD, are designed for the un-

signed n/(n

2)division, where nis the width of the dividend. As the borrow signal pass-

es through all subtractor cells, the critical path of an array divider is O(n2)[Parhami

2000]. Thus, the speed of the approximate array dividers is not very fast, but the area

and power dissipation are relatively low among the approximate designs. The accuracy

of the array dividers depends on the number of replaced cells (AXDnr and AXDr) and

the size of the sub-divider (DAXD).

The approximate dividers based on curve ﬁtting are very accurate, e.g, the maximum

relative error distance of a 16/16 HSD and FPD are 0.20% and 0.14%, respectively.

Although the curve ﬁtting using software is complex, the hardware implementation

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

60:28 H. Jiang et al.

consisting of look-up tables, smaller multipliers and adders is very simple. Thus, curve

ﬁtting based dividers are usually fast. However, the look-up tables used for storing the

constant coefﬁcients make the approximate dividers area and power consuming.

The rounding based approximate divider has a maximum relative error distance of

6.25% (for the design with the highest accuracy level) in a 16/16 divider and hence, it

is not very accurate. Its hardware implementation is simpler than those of the curve

ﬁtting based dividers. Therefore, the rounding based approximate divider has a very

high speed and relatively large area and power dissipation due to the use of look-up

tables.

5. IMAGE PROCESSING APPLICATION

Low power dissipation and small circuit area are basic requirements for consumer

electronic products, especially for mobile devices with stringent battery restrictions.

Therefore, approximate designs have been pervasively considered for implementation-

s in image processing. Approximate multipliers have been utilized for image sharp-

ening [Jiang et al. 2016b]. In this article, both the adder and multiplier in the image

sharpening algorithm are replaced by approximate designs. Moreover, approximate di-

viders are used to detect the difference between two images to show the changes. This

application is known as change detection.

5.1. Image Sharpening

The image sharpening algorithm computes R(x, y)=2I(x, y)−S(x, y)[Lau et al. 2009],

where Iis the input image, Ris the sharpened image, and Sis given by

S(x, y) = 1

4368

m=−2

n=−2

G(m+ 3, n + 3)I(x−m, y −n),(1)

where G is a 5×5matrix given by

G=





16 64 112 64 16

64 256 416 256 64

112 416 656 416 112

64 256 416 256 64

16 64 112 64 16







.(2)

Simulation results in [Jiang et al. 2016b] show that AM2-15, AM1-15, TAM2-16,

TAM1-16, BAM-16, AM2-13, AM1-13, ACM-4, ACM-3, TAM2-13, TAM1-13, BAM-17,

AWTM-4 and BAM-18 achieve visually acceptable image sharpening results. Among

these multipliers, the ones with a moderate hardware overhead (AM1-13, TAM2-13,

TAM1-16, TAM1-13, BAM-17 and BAM-18) are selected in this article for image sharp-

ening. Likewise, the approximate adders LOA, CSA, ETAII and CSPA are selected.

As the multiplication result of a 16 ×16 multiplier is 32-bit wide, 32-bit approximate

adders are used for image sharpening. The value of parameter kis 8 for CSA, ETAII

and CSPA, and 16 for LOA.

The results for image sharpening using the selected approximate multipliers and

adders are given in Table VII, while the accurate result is shown in Figure 22. The

images sharpened using CSPA have unacceptable defects and some defects (white dots)

can be seen in the image sharpened by AM1-13 and ETAII-8 when zooming into the

images in Table VII. Other images processed by the approximate designs show similar

quality with the accurate result. This is also conﬁrmed by the peak signal-to-noise ratio

(PSNR), as shown in Table VIII. The PSNRs of the images sharpened by a truncation

based multiplier (i.e., TAM1-16, TAM2-13, TAM1-13, BAM-17 or BAM-18) are ﬁxed

as the adder is changed among LOA-16, CSA-8 and ETAII-8. This occurs because the

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

A Review, Classiﬁcation and Comparative Evaluation of Approximate Arithmetic Circuits 60:29

Fig. 22. The image sharpened using an accurate multiplier and an accurate adder.

lower 16 bits of the multiplication results generated by these multipliers are zeros and

hence, only the higher half of an approximate adder (as an accurate 16-bit adder for

LOA-16, CSA-8 or ETAII-8) is used.

The image sharpening algorithm is implemented in VHDL by using the selected

approximate adders and multipliers. In the implementation, no pipelining or memory

unit is used for exclusively showing the hardware characteristics of the approximate

arithmetic circuits. It is synthesized by Synopsys DC using the same process, voltage

and temperature as in the simulation of the approximate adders. The 512 ×512 pixel

values of the image shown in Table VII are used as inputs for assessing the power

dissipation using the PrimeTime-PX tool at a clock period of 10 ns.

Table IX shows the circuit characteristics of image sharpening using approximate

multipliers and adders. Using the same multiplier, the image sharpening implementa-

tions with LOA-16 and ETAII-8 show similar values for the delay, power and area (ex-

cept for AM1-13), while the implementations using CSA-8 have relatively larger values

for these metrics. Likewise, the image sharpening circuits have similar characteristics

using the same adder except that AM1-13, BAM-17 and BAM-18 based schemes show

slightly larger values.

Compared with the image sharpening circuit using accurate multipliers and adders,

the approximate designs using CSA-8 or AM1-13 achieve small improvement in terms

of delay and area because CSA and AM1 are less efﬁcient in delay and area compared

with the other approximate designs. By using LOA-16, ETAII-8, TAM2-13, BAM-17

or BAM-18, the circuit can be 23% faster and saves as much as 53% in power, 58% in

area, 64% in PDP and 62% in ADP compared to the accurate design.

5.2. Change Detection

In the application of change detection, the ratio between two corresponding pixel val-

ues is calculated by a divider [Chen et al. 2016]. The changes in two images are then

highlighted by normalizing the pixel ratios. In this section, 16/8divider designs are

used to calculate the division of two 8-bit gray-level images as shown in Figure 23(a)

and (b). To ensure a higher accuracy, the pixel values of the ﬁrst image are multiplied

by 64. As HSD and FPD are designed for ﬂoating-point division, AXDr, DAXD and

SEERAD are selected for the change detection. Among the four types of AXDr, the

triangle replacement has been shown to achieve the best results for image processing

[Chen et al. 2016]. Therefore, three designs of AXDr with the triangle replacement of

depth 8, AXDr1 (using approximate subtractor 1), AXDr2 (using approximate subtrac-

tor 2) and AXDr3 (using approximate subtractor 3), are used for the change detection.

For DAXD, 8/4and 10/5accurate dividers are utilized in DAXD8 and DAXD10, re-

spectively. Four accuracy levels are considered in SEERAD; they are referred to as

SEERAD1, SEERAD2, SEERAD3 and SEERAD4. Moreover, the accurate array di-

vider (denoted as ArrayD) is simulated for comparison.

Figure 23 shows the change detection results by the dividers and the obtained P-

SNR value is shown in the parentheses. It is clear that AXDr1, AXDr3 and SEERAD4

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

60:30 H. Jiang et al.

Table VII. Images Sharpened using Different Approximate Adder and Multiplier

Pairs

Approximate

design LOA-16 CSA-8 ETAII-8 CSPA-8

TAM1-16

AM1-13

TAM2-13

TAM1-13

BAM-17

BAM-18

perform very well in the application of change detection, while the results by the other

designs are of lower quality.

The characteristics of the change detection circuits are obtained by using the same

tool and process as in the simulation of approximate adders. The clock period is 6 ns,

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

A Review, Classiﬁcation and Comparative Evaluation of Approximate Arithmetic Circuits 60:31

Table VIII. PSNRs of the Sharpened Images (dB)

```````````

Multiplier

Adder LOA-16 CSA-8 ETAII-8 CSPA-8

TAM1-16 46.97 46.97 46.97 25.01

AM1-13 45.21 45.06 36.86 24.20

TAM2-13 41.87 41.87 41.87 24.32

TAM1-13 41.42 41.42 41.42 24.35

BAM-17 40.09 40.09 40.09 25.19

BAM-18 33.99 33.99 33.99 24.21

Table IX. Delay, Power, and Area of Image Sharpening using Approximate

Multipliers and Adders

Multiplier Adder Delay

(ns)

Power

(mW )

PDP

(pJ)

Area

(um2)

ADP

(um2.ns)

ArrayM CLAG 6.74 1.995 13.45 31,183.9 210,179.5

TAM1-16 LOA-16 5.36 0.9723 5.21 18,139.0 97,225.0

TAM1-16 CSA-8 7.45 1.032 7.69 23,652.1 176,208

TAM1-16 ETAII-8 5.34 0.9643 5.15 18,056.8 96,423.3

AM1-13 LOA-16 5.41 1.193 6.45 26,644.0 144,144

AM1-13 CSA-8 7.41 1.377 10.20 30,586.5 226,646

AM1-13 ETAII-8 6.40 1.369 8.76 28,214.7 180,574

TAM2-13 LOA-16 5.25 1.055 5.54 17,057.8 89,553.5

TAM2-13 CSA-8 6.43 1.053 6.77 20,526.6 131,986

TAM2-13 ETAII-8 5.22 1.041 5.43 16,975.6 88,612.6

TAM1-13 LOA-16 5.25 0.9467 4.97 17,221.0 90,410.3

TAM1-13 CSA-8 7.45 0.9942 7.41 22,734.1 169,369

TAM1-13 ETAII-8 5.34 0.9350 4.88 17,138.8 89,464.5

BAM-17 LOA-16 6.14 1.226 7.53 14,993.8 92,061.9

BAM-17 CSA-8 7.36 1.247 9.17 16,533.0 121,683

BAM-17 ETAII-8 6.13 1.211 7.42 14,868.5 91,143.9

BAM-18 LOA-16 5.97 1.097 6.55 13,285.3 79,313.2

BAM-18 CSA-8 6.89 1.117 7.70 16,901.8 116,453

BAM-18 ETAII-8 5.96 1.076 6.41 13,156.0 78,409.8

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

60:32 H. Jiang et al.

(a) Input image 1 (b) Input image 2 (c) Accurate output

(d) AXDr1 (40.46 dB) (e) AXDr2 (22.91 dB) (f) AXDr3 (41.71 dB)

(g) DAXD8 (25.22 dB) (h) DAXD10 (24.33 dB) (i) SEERAD1 (22.08 dB)

(j) SEERAD2 (28.15 dB) (k) SEERAD3 (33.17 dB ) (l) SEERAD4 (36.61 dB)

Fig. 23. Change detection using different approximate dividers.

and the input combinations for the power evaluation are the two images shown in

Figure 23(a) and (b) with 384 ×507 pixels. The synthesis results are shown in Table X.

To be consistent with the other designs, AXDr1, AXDr2 and AXDr3 are implemented

at the gate level rather than at the transistor level as in [Chen et al. 2016].

Table X shows that the array-based dividers (ArrayD, AXDrs and DAXD) are more

power and area efﬁcient than the rounding based approximate dividers (SEERADs).

However, they are very slow except for DAXD that uses a smaller accurate array di-

vider. SEERADs are very fast, but they consume more power and area due to the use

of look-up tables and the large output size (including the fractional part). In terms of

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

A Review, Classiﬁcation and Comparative Evaluation of Approximate Arithmetic Circuits 60:33

PDP and ADP, DAXD10 has the smallest values followed by SEERAD3. Among the

approximate dividers that produce excellent change detection results, SEERAD4 has

the smallest PDP but the largest ADP. AXDr1 results in the largest PDP and the s-

mallest ADP, while AXDr3 has both moderate PDP and ADP. To be more speciﬁc, the

change detection using AXDr3 saves 25% of power and 12% of area compared with the

accurate design. It is 40% faster with 18% reduction in PDP by using SEERAD4.

Table X. Accuracy and Circuit Characteristics of the Change Detection using

Approximate Dividers.

Divider PSNR (dB)Delay

(ns)

Power

(uW )

PDP

(fJ)

Area

(um2)

ADP

(um2.ns)

ArrayD – 4.08 54.29 221.50 425.8 1,737.2

AXDr1 40.46 3.85 50.71 195.23 415.5 1,599.7

AXDr2 22.91 4.36 54.08 235.79 408.2 1,779.6

AXDr3 41.71 4.58 40.55 185.72 376.2 1,722.9

DAXD10 24.33 2.43 40.25 97.84 375.7 912.9

SEERAD3 33.17 1.83 60.27 110.29 615.8 1,126.8

SEERAD4 36.61 2.43 70.62 181.33 765.4 1,859.9

6. CONCLUSION

In this article, designs of approximate arithmetic circuits are reviewed. Their error

and circuit characteristics are evaluated using functional simulation and hardware

synthesis with an industrial 28nm technology library.

Approximate Adders: In general, approximate speculative adders show high accura-

cy and relatively small PDPs. The approximate adders using approximate full adders

in the LSBs are slow, but they are power efﬁcient with high ERs (due to the approx-

imate LSBs), low average error and moderate NMED and MRED values (due to the

accurate MSBs). The error and circuit characteristics of the segmented and carry select

adders vary with the prediction of the carry signals.

A truncated adder has a smaller MRED (an indicator of a smaller error magnitude)

than most approximate designs at a similar PDP except for LOA and CSA. However,

it has a lower performance and a signiﬁcantly higher ER compared with the other

approximate designs. As a result, a simple truncation of the LSBs in an adder causes a

high ER and does not signiﬁcantly improve the performance of the adder, though with

a relatively small error distance.

Approximate Multipliers: For approximate multipliers, truncation of part of the par-

tial products is an effective scheme to reduce hardware complexity, while preserving a

moderate NM ED and M RED. Similarly, truncating some LSBs of the input operands

can efﬁciently reduce the hardware overhead of a multiplier and result in a moderate

MRED (an indicator of the error magnitude) that is smaller than most other approxi-

mate designs, except for TAM1, TAM2 and ICM, for a similar PDP.

Albeit with a relatively low ER, UDM shows a low accuracy in terms of the error

distance and a relatively high circuit overhead, because the 2×2approximate multi-

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

60:34 H. Jiang et al.

plier is used to compute the most signiﬁcant bits and accurate adders are utilized to

accumulate the generated partial products. ICM has the lowest ER among all designs.

When truncation is not used, multipliers approximated in the partial product tree tend

to have a poor accuracy (except AWTM-3 and AWTM-4) and moderate hardware con-

sumption, while multipliers using approximate counters or compressors are usually

very accurate with relatively high power dissipation and hardware consumption. The

approximate Booth multipliers show different characteristics in hardware efﬁciency

and accuracy.

Approximate Dividers: For the dividers, the approximate array dividers are slow,

but they are hardware efﬁcient with variable accuracy depending on the approxima-

tion parameters. The dividers based on curve ﬁtting are very accurate and fast but

they require a large area and high power dissipation due to the utilization of look-up

tables. The rounding based approximate dividers have a very high speed, large area

and power dissipation, with a relatively low accuracy.

Application: Image sharpening is implemented using the selected approximate mul-

tipliers and adders. The accuracy and circuit characteristics of the image sharpening

obtained by simulation indicate signiﬁcant savings in hardware, while producing sim-

ilar sharpening quality as the accurate design. On average, the designs using approxi-

mate adders and multipliers with an acceptable accuracy consumes only 55% of power,

62% of area, 51% of PDP and 57% of ADP compared to the accurate design.

Approximate dividers are utilized in the change detection of images. The simulation

and synthesis results show that the change detection circuits using the approximate

array divider (AXDr1 and AXDr3) are power and area efﬁcient but very slow, whereas

the one using the rounding based approximate divider (SEERAD4) consumes more

power and area with a high performance for an excellent detection accuracy.

ACKNOWLEDGMENT

The authors would like to thank Vincent Camus from EPFL for his contributions in

proof reading the article.

REFERENCES

Tinku Acharya and Ajoy K Ray. 2005. Image processing: principles and applications. John Wiley & Sons.

Shaahin Angizi, Zhezhi He, Ronald F. DeMara, and Deliang Fan. 2017. Composite Spintronic Accuracy-

Conﬁgurable Adder for Low Power Digital Signal Processingr. In ISQED. IEEE.

Dursun Baran, Mustafa Aktan, and Vojin G Oklobdzija. 2010. Energy efﬁcient implementation of parallel

CMOS multipliers with improved compressors. In ACM/IEEE international symposium on Low power

electronics and design. ACM, 147–152.

Kartikeya Bhardwaj, Pravin S. Mane, and Jorg Henkel. 2014. Power- and area-efﬁcient Approximate Wal-

lace Tree Multiplier for error-resilient systems. In ISQED. 263–269.

Hao Cai, You Wang, Lirida AB Naviner, Zhaohao Wang, and Weisheng Zhao. 2016. Approximate comput-

ing in MOS/spintronic non-volatile full-adder. In International Symposium on Nanoscale Architectures

(NANOARCH). IEEE, 203–208.

Vincent Camus, Jeremy Schlachter, and Christian Enz. 2015. Energy-efﬁcient inexact speculative adder

with high performance and accuracy control. In IEEE International Symposium on Circuits and Systems

(ISCAS). 45–48.

Vincent Camus, Jeremy Schlachter, and Christian Enz. 2016. A low-power carry cut-back approximate adder

with ﬁxed-point implementation and ﬂoating-point precision. In Proceedings of the 53rd Annual Design

Automation Conference. ACM, 127.

Linbin Chen, Jie Han, Weiqiang Liu, and Fabrizio Lombardi. 2015. Design of Approximate Unsigned In-

teger Non-restoring Divider for Inexact Computing. In Proceedings of the 25th edition on Great Lakes

Symposium on VLSI. ACM, 51–56.

Linbin Chen, Jie Han, Weiqiang Liu, and Fabrizio Lombardi. 2016. On the Design of Approximate Restoring

Dividers for Error-Tolerant Applications. IEEE Trans. Comput. 65, 8 (2016), 2522–2533.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

A Review, Classiﬁcation and Comparative Evaluation of Approximate Arithmetic Circuits 60:35

Linbin Chen, Paolo Montuschi, Jie Han, Weiqiang Liu, and Fabrizio Lombardi. 2017. Design of Approximate

High-Radix Dividers by Inexact Binary Signed-Digit Addition. In Proceedings of the 27th IEEE/ACM

Great Lakes Symposium on VLSI.

Yuan-Ho Chen and Tsin-Yuan Chang. 2012. A high-accuracy adaptive conditional-probability estimator for

ﬁxed-width Booth multipliers. IEEE Trans. Circuits and Systems I: Regular Papers 59, 3 (2012), 594–

603.

Vinay K. Chippa, Debabrata Mohapatra, Anand Raghunathan, Kaushik Roy, and Srimat T. Chakradhar.

2010. Scalable effort hardware design: exploiting algorithmic resilience for energy efﬁciency. In DAC.

ACM Press, New York, New York, USA, 555–560.

Kyung-Ju Cho, Kwang-Chul Lee, Jin-Gyun Chung, and Keshab K Parhi. 2004. Design of low-error ﬁxed-

width modiﬁed booth multiplier. IEEE Transactions on VLSI Systems 12, 5 (2004), 522–531.

S.R. Datla, M.A Thornton, and D.W. Matula. 2009. A Low Power High Performance Radix-4 Approximate

Squaring Circuit. In ASAP. 91–97.

Kai Du, P. Varman, and K. Mohanram. 2012. High performance reliable variable latency carry select addi-

tion. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2012. 1257–1262.

Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012. Architecture support for disci-

plined approximate programming. In ACM SIGPLAN Notices, Vol. 47. ACM, 301–312.

Farzad Farshchi, Muhammad Saeed Abrishami, and Sied Mehdi Fakhraie. 2013. New approximate multi-

plier for low power digital signal processing. In CADS. IEEE, 25–30.

Michael J Flynn. 1970. On division by functional iteration. IEEE Trans. Comput. 100, 8 (1970), 702–706.

V. Gupta, D. Mohapatra, A. Raghunathan, and K. Roy. 2013. Low-Power Digital Signal Processing Using

Approximate Adders. IEEE Trans. CAD 32, 1 (January 2013), 124–137.

Jie Han. 2016. Introduction to Approximate Computing. In 2016 IEEE 34th VLSI Test Symposium (VTS).

IEEE, 1–1.

Jie Han and Michael Orshansky. 2013. Approximate computing: An emerging paradigm for energy-efﬁcient

design. In Test Symposium (ETS), 2013 18th IEEE European. IEEE, 1–6.

Soheil Hashemi, R Bahar, and Sherief Reda. 2015. Drum: A dynamic range unbiased multiplier for ap-

proximate applications. In Proceedings of the IEEE/ACM International Conference on Computer-Aided

Design. IEEE Press, 418–425.

Soheil Hashemi, R Bahar, and Sherief Reda. 2016. A low-power dynamic divider for approximate applica-

tions. In Proceedings of the 53rd Annual Design Automation Conference. ACM, 105.

Junjun Hu and Weikang Qian. 2015. A New Approximate Adder with Low Relative Error and Correct Sign

Calculation. In DATE. 1449–1454.

Jiawei Huang, John Lach, and Gabriel Robins. 2012. A methodology for energy-quality tradeoff using im-

precise hardware. In Proceedings of the 49th ACM Annual Design Automation Conference. 504–509.

Honglan Jiang, Jie Han, and Fabrizio Lombardi. 2015. A Comparative Review and Evaluation of Approxi-

mate Adders. In Proceedings of ACM Great Lakes Symposium on VLSI. 343–348.

Honglan Jiang, Jie Han, and Fabrizio Lombardi. 2016a. Approximate Radix-8 Booth Multiplier for Low-

Power and High-Performance Operation. IEEE Trans. Comput. 65, 8 (2016), 2638–2644.

Honglan Jiang, Jie Han, and Fabrizio Lombardi. 2016b. A Comparative Evaluation of Approximate Multi-

pliers. In IEEE/ACM International Symposium on Nanoscale Architectures.

Andrew B Kahng and Seokhyeong Kang. 2012. Accuracy-conﬁgurable adder for approximate arithmetic

designs. In Proceedings of the 49th ACM Annual Design Automation Conference. 820–825.

D Kelly, B Phillips, and S Al-Sarawi. 2009. Approximate signed binary integer multipliers for arithmetic

data value speculation. In Conference on Design & Architectures For Signal And Image Processing.

Sophia Antipolis, France.

Yongtae Kim, Yong Zhang, and Peng Li. 2013. An Energy Efﬁcient Approximate Adder with Carry Skip for

Error Resilient Neuromorphic VLSI Systems. In ICCAD. 130–137.

Parag Kulkarni, Puneet Gupta, and Milos Ercegovac. 2011. Trading accuracy for power with an underde-

signed multiplier architecture. In International Conference on VLSI Design. 346–351.

Khaing Yin Kyaw, Wang Ling Goh, and Kiat Seng Yeo. 2010. Low-power high-speed multiplier for error-

tolerant application. In EDSSC. 1–4.

Mark SK Lau, Keck-Voon Ling, and Yun-Chung Chu. 2009. Energy-aware probabilistic multiplier: design

and analysis. In CASES. 281–290.

Li Li and Hai Zhou. 2014. On Error Modeling and Analysis of Approximate Adders. In ICCAD. 511–518.

Jinghang Liang, Jie Han, and F. Lombardi. 2013. New Metrics for the Reliability of Approximate and Prob-

abilistic Adders. IEEE Trans. Comput. 62, 9 (September 2013), 1760–1771.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

60:36 H. Jiang et al.

Chia-Hao Lin and Ing-Chao Lin. 2013. High accuracy approximate multiplier with error correction. In ICCD.

IEEE, 33–38.

IngChao Lin, YiMing Yang, and ChengChian Lin. 2015. High-Performance Low-Power Carry Speculative

Addition With Varible Latency. IEEE Trans. VLSI Syst. 23, 9 (2015), 1591–1603.

Cong Liu. 2014. Design and Analysis of Approximate Adders and Multipliers. Master’s thesis. University of

Alberta, Canada.

Cong Liu, Jie Han, and Fabrizio Lombardi. 2014. A Low-Power, High-Performance Approximate Multiplier

with Conﬁgurable Partial Error Recovery. In DATE.

Cong Liu, Jie Han, and Fabrizio Lombardi. 2015. An analytical framework for evaluating the error charac-

teristics of approximate adders. IEEE Trans. Comput. 64, 5 (2015), 1268–1281.

Cong Liu, Honglan Jiang, Fabrizio Lombardi, and Jie Han. 2017a. High-Performance Approximate Un-

signed Multipliers with Conﬁgurable Error Recovery. IEEE Transactions on Circuits and Systems I,

under revision. (2017).

Wei Liu and Alberto Nannarelli. 2012. Power efﬁcient division and square root unit. IEEE Trans. Comput.

61, 8 (2012), 1059–1070.

Weiqiang Liu, Liangyu Qian, Chenghua Wang, Honglan Jiang, Jie Han, and Fabrizio Lombardi. 2017b.

Design of Approximate Radix-4 Booth Multipliers for Error-Tolerant Computing. IEEE Trans. Comput.

(2017).

Joshua Yung Lih Low and Ching Chuen Jong. 2013. Non-iterative high speed division computation based

on Mitchell logarithmic method. In IEEE International Symposium on Circuits and Systems (ISCAS).

2219–2222.

Shih-Lien Lu. 2004. Speeding up processing with approximation circuits. Computer 37, 3 (March 2004),

67–73.

Jieming Ma, Ka Lok Man, Nan Zhang, Sheng-Uei Guan, and Taikyeong Ted Jeong. 2013. High-speed area-

efﬁcient and power-aware multiplier design using approximate compressors along with bottom-up tree

topology. In ICMV: Algorithms, Pattern Recognition, and Basic Technologies.

H R Mahdiani, A Ahmadi, S M Fakhraie, and C Lucas. 2010. Bio-Inspired Imprecise Computational Blocks

for Efﬁcient VLSI Implementation of Soft-Computing Applications. IEEE Trans. Circuits and Systems

57, 4 (April 2010), 850–862.

Sana Mazahir, Osman Hasan, Rehan Haﬁz, Muhammad Shaﬁque, and Jörg Henkel. 2017. Probabilistic

error modeling for approximate adders. IEEE Trans. Comput. 66, 3 (2017), 515–530.

Jin Miao, Ku He, Andreas Gerstlauer, and Michael Orshansky. 2012. Modeling and synthesis of quality-

energy optimal approximate adders. In Proceedings of the ACM International Conference on Computer-

Aided Design. 728–735.

Joshua San Miguel, Jorge Albericio, Andreas Moshovos, and Natalie Enright Jerger. 2015. Doppelgänger: a

cache for approximate computing. In Proceedings of the 48th International Symposium on Microarchi-

tecture. ACM, 50–61.

SONG Min-An, VAN Lan-Da, and KUO Sy-Yen. 2007. Adaptive low-error ﬁxed-width Booth multipliers.

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences 90, 6

(2007), 1180–1187.

John N Mitchell Jr. 1962. Computer multiplication and division using binary logarithms. IRE Transactions

on Electronic Computers 4 (1962), 512–517.

D. Mohapatra, V.K. Chippa, A Raghunathan, and K. Roy. 2011. Design of voltage-scalable meta-functions

for approximate computing. In Design, Automation & Test in Europe Conference & Exhibition (DATE).

1–6.

Amir Momeni, Jie Han, Paolo Montuschi, and Fabrizio Lombardi. 2015. Design and Analysis of Approximate

Compressors for Multiplication. IEEE Trans. Comput. 64, 4 (2015), 984–994.

Vojtech Mrazek, Syed Shakib Sarwar, Lukas Sekanina, Zdenek Vasicek, and Kaushik Roy. 2016. Design

of power-efﬁcient approximate multipliers for approximate artiﬁcial neural networks. In International

Conference On Computer Aided Design (ICCAD). 7.

Srinivasan Narayanamoorthy, Hadi Asghari Moghaddam, Zhenhong Liu, Taejoon Park, and Nam Sung

Kim. 2015. Energy-efﬁcient approximate multiplication for digital signal processing and classiﬁcation

applications. IEEE Transactions on VLSI Systems 23, 6 (2015), 1180–1184.

Vojin G. Oklobdzija, David Villeger, and Simon S. Liu. 1996. A method for speed optimized partial prod-

uct reduction and generation of fast parallel multipliers using an algorithmic approach. IEEE Trans.

Comput. 45, 3 (1996), 294–306.

Behrooz Parhami. 2000. Computer arithmetic. Oxford university press.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

A Review, Classiﬁcation and Comparative Evaluation of Approximate Arithmetic Circuits 60:37

Adrian Sampson, Werner Dietl, Emily Fortuna, Danushen Gnanapragasam, Luis Ceze, and Dan Grossman.

2011. EnerJ: Approximate data types for safe and general low-power computation. 46, 6 (June 2011),

164–174.

Doochul Shin. 2010. Approximate logic synthesis for error tolerant applications. In DATE. IEEE, 957–960.

Zdenek Vasicek and Lukas Sekanina. 2015. Evolutionary approach to approximate digital circuits design.

IEEE Transactions on Evolutionary Computation 19, 3 (2015), 432–444.

Rangharajan Venkatesan, Amit Agarwal, Kaushik Roy, and Anand Raghunathan. 2010. MACACO: Model-

ing and analysis of circuits for approximate computing. In ICCAD. 667–673.

Ajay K Verma, Philip Brisk, and Paolo Ienne. 2008. Variable latency speculative addition: A new paradigm

for arithmetic circuit design. In DATE. 1250–1255.

Jiun-Ping Wang, Shiann-Rong Kuang, and Shish-Chang Liang. 2011. High-accuracy ﬁxed-width modiﬁed

Booth multipliers for lossy applications. IEEE Transactions on VLSI Systems 19, 1 (2011), 52–60.

Lei Wu and Ching Chuen Jong. 2015. A curve ﬁtting approach for non-iterative divider design with ac-

curacy and performance trade-off. In IEEE 13th International New Circuits and Systems Conference

(NEWCAS). 1–4.

Xinghua Yang, Yue Xing, Fei Qiao, Qi Wei, and Huazhong Yang. 2016. Approximate Adder with Hybrid

Prediction and Error Compensation Technique. In IEEE Computer Society Annual Symposium on VLSI

(ISVLSI). IEEE, 373–378.

Zhixi Yang, Ajaypat Jain, Jinghang Liang, Jie Han, and Fabrizio Lombardi. 2013. Approximate XOR/XNOR-

based Adders for Inexact Computing. In IEEE-NANO. 690–693.

Rong Ye, Ting Wang, Feng Yuan, Rakesh Kumar, and Qiang Xu. 2013. On Reconﬁguration-Oriented Approx-

imate Adder Design and Its Application. In ICCAD. 48–54.

Reza Zendegani, Mehdi Kamal, Milad Bahadori, Ali Afzali-Kusha, and Massoud Pedram. 2017. RoBA mul-

tiplier: A rounding-based approximate multiplier for high-speed yet energy-efﬁcient digital signal pro-

cessing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25, 2 (2017), 393–401.

Reza Zendegani, Mehdi Kamal, Arash Fayyazi, Ali Afzali-Kusha, Saeed Safari, and Massoud Pedram. 2016.

SEERAD: A high speed yet energy-efﬁcient rounding-based approximate divider. In Design, Automation

& Test in Europe Conference & Exhibition (DATE). IEEE, 1481–1484.

Ning Zhu, Wang Ling Goh, and Kiat Seng Yeo. 2009. An enhanced low-power high-speed adder for error-

tolerant application. In ISIC 2009. 69–72.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, No. 4, Article 60, Pub. date: July 2017.

DAEM: A Data- and Application-Aware Error Analysis Methodology for Approximate Adders

Article

Full-text available

Oct 2023

Approximate adders are some of the fundamental arithmetic operators that are being employed in error-resilient applications, to achieve performance/energy/area gains. This improvement usually comes at the cost of some accuracy and, therefore, requires prior error analysis, to select an approximate adder variant that provides acceptable accuracy. Most of the state-of-the-art error analysis techniques for approximate adders assume input bits and operands to be independent of one another, while some also assume the operands to be uniformly distributed. In this paper, we analyze the impact of these assumptions on the accuracy of error estimation techniques, and we highlight the need to address these assumptions, to achieve better and more realistic quality estimates. Based on our analysis, we propose DAEM, a data- and application-aware error analysis methodology for approximate adders. Unlike existing error analysis models, we neither assume the adder operands to be uniformly distributed nor assume them to be independent. Specifically, we use 2D joint input probability mass functions (PMFs), populated using sample data, in order to incorporate the data and application knowledge in the analysis. These 2D joint input PMFs, along with 2D error maps of approximate adders, are used to estimate the error PMF of an adder network. The error PMF is then utilized to compute different error measures, such as the mean squared error (MSE) and mean error distance (MED). We evaluate the proposed error analysis methodology on audio and video processing applications, and we demonstrate that our methodology provides error estimates having a better correlation with the simulation results, as compared to the state-of-the-art techniques.

Design of Pass Transistor-Based Low-Power Approximate Adders for DSP Application

Conference Paper

Dec 2023

This paper presents a low-power and energy-efficient pass transistor-based full adder, along with two other approximate adders sharing the same constructional pattern. Initially, three conventional full adders are analyzed, namely the mirror adder, XOR, and XNOR-based adder circuits. Based on these circuits, a new energy-efficient 12T full adder is designed that consumes much less power compared to conventional full adder circuits. The design is implemented in 45nm technology with a +0.8 V supply voltage. Schematic simulations in Cadence Virtuoso tools showed a 60%-85% improvement in power consumption compared to previously stated conventional adders. Furthermore, the research extends to the development of two innovative approximate adder designs that demonstrate remarkable improvements in power efficiency while maintaining an acceptable level of accuracy. Finally, a comparative analysis of conventional approximate full adders proved our proposed approximate adders outperformed existing ones in terms of power consumption, power delay product, and bit error rate.

On the Commutative Operation of Approximate CMOS Ripple Carry Adders (RCAs)

Article

Jan 2023

Approximate cells can be used to design Ripple Carry Adders (RCAs) for realizing approximate addition in energy-efficient CMOS digital circuits. As inputs of approximate cells could be non-commutative in nature, approximate adders may show different output values under a commutative operation, and this may have a significant effect on the generated sum. This paper presents a detailed analysis of the commutative addition in RCAs made of different approximate cells. Initially, the impact of a non-commutative addition (NCA) to RCAs by approximate cells is assessed by exhaustive simulation at adder level. The results show that at most 17% of additions executed using AFA3 suffer from the non-commutative property, while the values for other adder cells can reach 75%∼99%. Then, an extensive analysis using images from a publicly available library is performed by comparing three-image additions with two-image additions. As a further evaluation, the adders are assessed in an image denoising application. As expected, the effect of NCA is especially pronounced for some non-commutative adders, such as AA2 and AMA4. NCA is also cumulative with the number of approximate additions, thereby causing a significant variation in the output image quality. In terms of metrics, the largest average difference in mean error distance (DMED) for three-image addition is 5.3 times higher than for two-image addition. Rankings of the non-commutative approximate adders show that AMA3 and AFA1 based adders are the best schemes with respect to commutative addition; they both also show good performance in image denoising.

Efficient Design of Modified Wallace Tree Approximate Multipliers Based on Imprecise Compressors for Error-Tolerance Applications

Article

Oct 2023

Several Approximate Multipliers (AMs) are developed by integrating approximate computing techniques into conventional exact multiplier architectures. Notably, AMs utilizing 4:2 Imprecise Compressors (IMCs) have garnered substantial attention within the scientific community. Hence, this paper proposes two novel 4:2 IMCs, denoted as D1 and D2 IMCs, as a foundational step. Subsequently, two modified Wallace tree-based AM architectures, ARCH1_D1 and ARCH2_D2, are proposed, incorporating the newly introduced IMCs. To comprehensively evaluate the approach, all existing and proposed AMs are considered input bit lengths spanning from 8 to 16 bits. These simulations are executed using both Vivado and MATLAB, and to further validate the findings, the AM designs are synthesized using the Cadence RTL compiler. The simulation results demonstrate that the proposed AMs exhibit significantly improved accuracy metrics compared to existing methods. Notably, our suggested AMs achieve average enhancements of 70.6% in accuracy, 60.4% in Energy-Delay-Product, 30.9% in Power-Delay-Product, and 41.6% in delay, all surpassing the performance of their existing counterparts. Furthermore, the assessment is extended to evaluate the performance of the proposed multipliers within the context of the Similarity Index Metric (SSIM) and Peak Signal-to-Noise Ratio (PSNR). This evaluation is done with the proposed multipliers integrated into an Image Smoothening Filter (ISF) framework, catering to applications demanding robust error tolerance. The results reveal that the proposed multipliers consistently yield higher SSIM and PSNR metrics than the existing AMs when incorporated into the ISF architecture.

Multiplexer Based Error Efficient Fixed-Width Adder Tree Design for Signal Processing Applications

Article

Full-text available

Sep 2023

In copious mixed media applications, humans cannot necessarily discern error free or erroneous outputs, owing to the small range of perception abilities. Crucial information can still be obtained from marginally inexact outputs. Leveraging this, many algorithms such as Digital Signal Processing (DSP), Discrete Cosine Transform (DCT), Motion Compensation (MC) use approximate calculations while still maintaining appreciable computation accuracy. When data processing algorithms are taken into consideration, adders play an important role in the arithmetic module by managing the power and area utilization of the system. A fixed-width adder tree design for approximate calculations is proposed, that uses the trade-off between area and accuracy as the base for analysis. Our design uses 12.42%, 18.17% and 5.05% lesser area compared to the full width adder tree, FX-AT-PT and FX-AT-DT respectively. Additionally, when compared to FX-AT-DT, TFX-AT and ITFX-AT, the proposed design has an improved Maximum Error Distance (MED) of (33.33%, 29.03%), (63.64%, 65.63%) and (38.46%, 35.29%) for N = 8, 16 respectively. To cope with the inaccuracy caused by truncation, the proposed design employs mux-based radix-4 addition coupled with bias estimation. Further, to examine the error performance we have incorporated the proposed design and a few other existing designs into the Walsh-Hadamard Transform (WHT), to process images with different metrics and compare the Peak Signal to Noise Ratio (PSNR) of the images. It was observed that the proposed design showed significant improvement in the PSNR score when compared to ITFX-AT and maintains a score similar to that of FX-AT-PT.

Design and Implementation of High-Performance Multi-Level Approximate Multiplier

Conference Paper

Mar 2024

Design Wireless Communication Circuits and Systems Using Approximate Computing

Chapter

Jan 2024

In recent years, the communication circuits and systems have become more complicated and more power hungry. However, due to the channel noise and forward error correction (FEC) module, communication systems can inherently tolerate certain errors at the receiving end. Approximate computing schemes have been explored in many communication systems to achieve low complexity and low power design. This chapter introduces approximate floating point adder design with mantissa truncation and compensation method, low-complexity floating-point multiplier design with partial product location adjustment and method based on partial product probability compensation. A top-down FP FFT processor design strategy for determined accuracy requirement is proposed with several times mantissa bit-width search iteration, which achieves high efficiency. At last, two approximate polar code decoders are introduced with approximate PE units or approximate sorter design, and the hardware efficiency increased significantly with negligible BER performance deterioration.

Design and Analysis of High Performance Heterogeneous Block-based Approximate Adders

Article

Sep 2023

Approximate computing is an emerging paradigm to improve the power and performance efficiency of error-resilient applications. As adders are one of the key components in almost all processing systems, a significant amount of research has been carried out towards designing approximate adders that can offer better efficiency than conventional designs, however, at the cost of some accuracy loss. In this paper, we highlight a new class of energy-efficient approximate adders, namely Heterogeneous Block-based Approximate Adders (HBAA), and propose a generic configurable adder model that can be configured to represent a particular HBAA configuration. An HBAA, in general, is composed of heterogeneous sub-adder blocks of equal length, where each sub-adder can be an approximate sub-adder and have a different configuration. The sub-adders are mainly approximated through inexact logic and carry truncation. Compared to the existing design space, HBAAs provide additional design points that fall on the Pareto-front and offer a better quality-efficiency trade-off in certain scenarios. Furthermore, to enable efficient design space exploration based on user-defined constraints, we propose an analytical model to efficiently evaluate the Probability Mass Function (PMF) of approximation error and other error metrics, such as Mean Error Distance (MED), Normalized Mean Error Distance (NMED) and Error Rate (ER) of HBAAs. The results show that HBAA configurations can provide around $15\% $ reduction in area and up to $17\% $ reduction in energy compared to state-of-the-art approximate adders.

Evaluation of Imprecise Subtractors into Test Zone Search for VVC Encoding

Conference Paper

Aug 2023

DIMCA: An Area-Efficient Digital In-Memory Computing Macro Featuring Approximate Arithmetic Hardware in 28 nm

Article

Jan 2023

Recent SRAM-based in-memory computing (IMC) hardware demonstrates high energy efficiency and throughput for matrix–vector multiplication (MVM), the dominant kernel for deep neural networks (DNNs). Earlier IMC macros have employed analog-mixed-signal (AMS) arithmetic hardware. However, those so-called AIMCs suffer from process, voltage, and temperature (PVT) variations. Digital IMC (DIMC) macros, on the other hand, exhibit better robustness against PVT variations, but they tend to require more silicon area. This article proposes novel DIMC hardware featuring approximate arithmetic () to improve area efficiency without hurting compute density (CD). We also propose an approximation-aware training model and a customized number format to compensate for the accuracy degradation caused by the approximation hardware. We prototyped the test chip in 28-nm CMOS. It contains two versions: the with single-approximate hardware () and with double-approximate hardware (). The measurement results show that supports a 4 b-activation and 1 b-weight (4 b/1 b) CNN model, achieving 327 kb/mm $^2$ , 458–990 TOPS/W (normalized to 1 b/1 b), 8.27–392 TOPS/mm $^2$ (normalized to 1 b/1 b), and 90.41% accuracy for CIFAR-10. supports a 1 b/1 b CNN model, achieving 485 kb/mm $^2$ , 932–2219 TOPS/W, 14.4–607 TOPS/mm $^2$ , and 86.96% accuracy for CIFAR-10.

Design of Power-Efficient Approximate Multipliers for Approximate Artificial Neural Networks

Conference Paper

Full-text available

Nov 2016

Artificial neural networks (NN) have shown a significant promise in difficult tasks like image classification or speech recognition. Even well-optimized hardware implementations of digital NNs show significant power consumption. It is mainly due to non-uniform pipeline structures and inherent redundancy of numerous arithmetic operations that have to be performed to produce each single output vector. This paper provides a methodology for the design of well-optimized power-efficient NNs with a uniform structure suitable for hardware implementation. An error resilience analysis was performed in order to determine key constraints for the design of approximate multipliers that are employed in the resulting structure of NN. By means of a search based approximation method, approximate multipliers showing desired tradeoffs between the accuracy and implementation cost were created. Resulting approximate NNs, containing the approximate multipliers, were evaluated using standard benchmarks (MNIST dataset) and a real-world classification problem of Street-View House Numbers. Significant improvement in power efficiency was obtained in both cases with respect to regular NNs. In some cases, 91% power reduction of multiplication led to classification accuracy degradation of less than 2.80%. Moreover, the paper showed the capability of the back propagation learning algorithm to adapt with NNs containing the approximate multipliers.

Image Processing: Principles and Applications

Book

Sep 2005

Design of Approximate High-Radix Dividers by Inexact Binary Signed-Digit Addition

Conference Paper

May 2017

Approximate high radix dividers (HR-AXDs) are proposed and investigated in this paper. High-radix division is reviewed and inexact computing is introduced at different levels. Design parameters such as number of bits (N) and radix (r) are considered in the analysis; the replacement schemes with inexact cells and truncation schemes of exact cells in the binary signed-digit adder array is introduced. Circuit-level performance and the error characteristics of the inexact high radix dividers are analyzed for the proposed designs. The combined assessment of the normal error distance, power dissipation and delay is investigated and applications of approximate high-radix dividers are treated in detail. The simulation results show that the proposed approximate dividers offer extensive saving in terms of power dissipation, circuit complexity and delay, while only incurring in a small degradation in accuracy thus making them possibly suitable and interesting to some applications and domains such as low power/mobile computing.

Composite spintronic accuracy-configurable adder for low power Digital Signal Processing

Conference Paper

Mar 2017

EnerJ: approximate data types for safe and general low-power computation

Article

Aug 2012

Energy is increasingly a first-order concern in computer systems. Exploiting energy-accuracy trade-offs is an attractive choice in applications that can tolerate inaccuracies. Recent work has explored exposing this trade-off in programming models. A key challenge, though, is how to isolate parts of the program that must be precise from those that can be approximated so that a program functions correctly even as quality of service degrades. We propose using type qualifiers to declare data that may be subject to approximate computation. Using these types, the system automatically maps approximate variables to low-power storage, uses low-power operations, and even applies more energy-efficient algorithms provided by the programmer. In addition, the system can statically guarantee isolation of the precise program component from the approximate component. This allows a programmer to control explicitly how information flows from approximate data to precise data. Importantly, employing static analysis eliminates the need for dynamic checks, further improving energy savings. As a proof of concept, we develop EnerJ, an extension to Java that adds approximate data types. We also propose a hardware architecture that offers explicit approximate storage and computation. We port several applications to EnerJ and show that our extensions are expressive and effective; a small number of annotations lead to significant potential energy savings (10%-50%) at very little accuracy cost.

Design of Approximate Radix-4 Booth Multipliers for Error-Tolerant Computing

Article

Feb 2017

Approximate computing is an attractive design methodology to achieve low power, high performance (low delay) and reduced circuit complexity by relaxing the requirement of accuracy. In this paper, approximate Booth multipliers are designed based on approximate radix-4 modified Booth encoding (MBE) algorithms and a regular partial product array that employs an approximate Wallace tree. Two approximate Booth encoders are proposed and analyzed for error-tolerant computing. The error characteristics are analyzed with respect to the so-called approximation factor that is related to the inexact bit width of the Booth multipliers. Simulation results at 45 nm feature size in CMOS for delay, area and power consumption are also provided. The results show that the proposed 16-bit approximate radix-4 Booth multipliers with approximate factors of 12 and 14 are more accurate than existing approximate Booth multipliers with moderate power consumption. The proposed R4ABM2 multiplier with an approximation factor of 14 is the most efficient design when considering both power-delay product and the error metric NMED. Case studies for image processing show the validity of the proposed approximate radix-4 Booth multipliers.

Approximate Adder with Hybrid Prediction and Error Compensation Technique

Conference Paper

Jul 2016

Probabilistic Error Modeling for Approximate Adders

Article

Sep 2016

Approximate adders are widely being advocated as a means to achieve performance gain in error resilient applications. In this paper, a generic methodology for analytical modeling of probability of occurrence of error and the Probability Mass Function (PMF) of error value in a selected class of approximate adders is presented, which can serve as performance metrics for the comparative analysis of various adders and their configurations. The proposed model is applicable to approximate adders that comprise of sub-adder units of uniform as well as non-uniform lengths. Using a systematic methodology, we derive closed form expressions for the probability of error for a number of state-of-the-art high-performance approximate adders. The probabilistic analysis is carried out for arbitrary input distributions. It can be used to study the dependence of error statistics in an adder's output on its configuration and input distribution. Moreover, it is shown that by building upon the proposed error model, we can estimate the probability of error in circuits with multiple approximate adders. We also demonstrate that, using the proposed analysis, the comparative performance of different approximate adders can be correctly predicted in practical applications of image processing.

RoBA Multiplier: A Rounding-Based Approximate Multiplier for High-Speed yet Energy-Efficient Digital Signal Processing

Article

Jul 2016

In this paper, we propose an approximate multiplier that is high speed yet energy efficient. The approach is to round the operands to the nearest exponent of two. This way the computational intensive part of the multiplication is omitted improving speed and energy consumption at the price of a small error. The proposed approach is applicable to both signed and unsigned multiplications. We propose three hardware implementations of the approximate multiplier that includes one for the unsigned and two for the signed operations. The efficiency of the proposed multiplier is evaluated by comparing its performance with those of some approximate and accurate multipliers using different design parameters. In addition, the efficacy of the proposed approximate multiplier is studied in two image processing applications, i.e., image sharpening and smoothing.

Approximate Computing in MOS/Spintronic Non-Volatile Full-Adder

Conference Paper

Jul 2016

Approximate computing and its related topics have shown the potential in next generation computing systems. In this paper, new circuit level design for approximate computing is proposed based on non-volatile (NV) logic-in-memory structure. Two types of NV approximate adders are implemented with circuit reconfiguration and insufficient writing current. Spin torque transfer magnetic tunnel junction (STT-MTJ) is used as NV memory element in magnetic full adder (MFA). The proposed approximate MFAs are implemented with 28nm ultra thin body and buried oxide (UTBB) fully depleted silicon-on-insulator (FD-SOI) technology. Simulation results are presented including power consumption, circuit latency, leakage power, error distance and reliability performance. Low V dd design strategies are discussed as well.

A Comparative Review and Evaluation of Approximate Adders

Abstract and Figures

Recommended publications

Approximate Adders for digital signal processing (DSP) applications – A relative study

A Review, Classification, and Comparative Evaluation of Approximate Arithmetic Circuits

Approximate Arithmetic Circuits: Design and Evaluation: Methodologies and CAD

Characterizing Approximate Adders and Multipliers Optimized under Different Design Constraints