ArticlePDF Available

Optimal Final Carry Propagate Adder Design for Parallel Multipliers

October 2011

October 2011

Source
arXiv

Authors:

VIT University

Based on the ASIC layout level simulation of 7 types of adder structures each of four different sizes, i.e. a total of 28 adders, we propose expressions for the width of each of the three regions of the final Carry Propagate Adder (CPA) to be used in parallel multipliers. We also propose the types of adders to be used in each region that would lead to the optimal performance of the hybrid final adders in parallel multipliers. This work evaluates the complete performance of the analyzed designs in terms of delay, area, power through custom design and layout in 0.18 um CMOS process technology.

Content uploaded by H. M. Kittur

Content may be subject to copyright.

Optimal Final Carry Propagate Adder Design for

Parallel Multipliers

Ramkumar B and Harish M Kittur

Abstract- Based on the ASIC layout level simulation of 7

types of adder structures each of four different sizes, i.e. a

total of 28 adders, we propose expressions for the width of

each of the three regions of the final Carry Propagate Adder

(CPA) to be used in parallel multipliers. We also propose

the types of adders to be used in each region that would lead

to the optimal performance of the hybrid final adders in

parallel multipliers. This work evaluates the complete

performance of the analyzed designs in terms of delay, area,

power through custom design and layout in 0.18 um CMOS

process technology.

Index terms – ASIC (Application Specific Integrated

Circuit), optimal hybrid CPA, Parallel multiplier, low

power, area efficient.

I. INTRODUCTION

The critical signal path in a parallel multiplier is

divided into three domains: AND gate array, PPST

(Partial Product Summation Tree) and the final CPA. The

delay introduced by the AND gate is relatively small

compared to the other two components, especially for the

large size multiplier. This delay component is also

relatively independent of the size of the multiplier. The

delay introduced by the PPST and the final CPA

constitutes a dominant component of the delay in the

multiplier [1].

Hybrid CPA have been proposed earlier with detailed

investigations on the final addition of parallel multipliers

[1]-[3]. It is well known that the signals applied to the

inputs of the CPA arrive first at the ends of the CPA and

the last ones are those in the middle of the CPA. So the

determination of the exact arrival time to final adder is of

prime importance in the design of the optimal final adder.

We have therefore analyzed the arrival time from the

PPST through layout implementation and based on those

arrival times, the inputs has been applied to the 7 type of

adders for 4 different bit sizes ( total of 28 adders). The

analysis is done by using industry standard tool and based

on the post layout simulation results we have designed

the optimal final structure. The investigation includes 8

by 8, 16 by 16, 32 by 32 and 64 by 64 Dadda multiplier

with the final adders being 16, 32, 64 and 128-bit Ripple

Carry Adder (RCA), Carry Save Adder (CSA), Carry

Select Adder (CSLA), Carry Look Ahead Adder (CLA)

and BEC (Binary to Excess1) based adders called here as

BEC Carry Select Adder (BCSLA), BEC Carry Save

Adder (BCSA) and BEC Carry Look Ahead adder

(BCLA)[4]-[10].

This paper is structured as follows; Section II deals

with the design of the PPST based on Dadda algorithm

and analysis of the signal arrival profile from the PPST.

The analysis of the performance of various adders in

terms of area, delay and power is in Section III. The

equations for efficient partitioning of the multiplier

region are developed in Section IV. The final adder

design and ASIC implementation details are provided in

Section V and VI respectively. Finally the work is

concluded in Section VII.

II. ANALYSIS OF PPST SIGNAL ARRIVAL PROFILE

The basic top-level implementation for N by N

unsigned parallel multiplier without CPA is shown in Fig.

1. To analyze the exact arrival time from the PPST, the

multiplier is implemented without CPA.

Signal Buffering

In order to determine typical signal arrival profile and

drive strengths, D flip-flops are used on the primary

inputs & outputs. D flip-flops drive multiple buffers to

distribute input signals to N2 AND gates. Delay

simulations were performed for each cell library to

resolve,

a) The maximum number of buffers that a single D flip-

flop can drive.

b) The maximum number of AND gate inputs that a

single buffer can drive.

N N

……P

CLK

Buffers

AND Gate Array

Partial Product Reduction

D flip-flops + load caps

D flip-flops D flip-flops

Multiplicand (A) Multiplier (B)

This work was supported in part by the Integrated Circuit Design

Laboratories, VIT University, Vellore, India.

B.Ramkumar1 and Harish Kittur2 are with the VLSI division, School

of Electronics Engineering, VIT University, Vellore, India. (email:

ramkumar.b@vit.ac.in1; kittur@vit.ac.in2) Fig. 1. Top-level implementation of N by N multiplier without CPA

Partial Product Reduction Tree

The Wallace and Dadda methods are the popular

partial product reduction algorithm for fast multipliers

[11]-[12], but a closer examination of the delays and area

within these two multiplier shows that the Dadda

multiplier is slightly faster and area efficient than the

Wallace multiplier [13]-[14].

For the reduction of the N by N partial matrix, Dadda

proposes a sequence of matrix heights that are determined

by working back from the final two-row matrix. In order

to implement the minimum number of reduction stages,

the height of each intermediate matrix is limited to the

largest integer that is no more than 1.5 times the height of

its successor. In this paper, the PPST is implemented

based on the Dadda algorithm.

III. FINAL ADDER ANALYSIS

Once the partial product matrix has been reduced to a

height of two, the final stage CPA length is determined in

the Dadda approach as below,

CPA length = 2N – 2

Fig.1 details the connection of a AND gate array (p0) and

compression strategy (p1 to ps) to a D flip-flop and

capacitive load which finds the arrival time from the

inputs to the final CPA. The evaluated arrival profile is

fixed as input delay to the CPA using timing constraint

file as shown in Fig. 2.

A. 8 by 8 Dadda multiplier

The input arrival profile to the final CPA of 8 by 8

Dadda multiplier is shown in Fig. 3a. Based on the arrival

profile it is divided into 3 regions, in which the first

region has a positive slope from the point 0 to 5 and the

second region has a constant region from point 6 to 14

and the third region has a negative slope which is point

15. Since the negative slope has only one point we can

include this region within the second region.

The performance comparisons, of the 7-types of 16-bit

adders for the 8 by 8 multiplier, in terms of output timing

are shown in Fig. 3(b) and Fig. 3(c). The area and power

comparison is shown in Fig. 3(d) and Fig. 3(e)

respectively. These values are predicted from the post

layout simulation of each adder. In each of the 3 region,

we note that more than one adder is performing well with

only slight differences. In the positive slope region of the

comparison graph, the RCA and CSA are performing

faster. The difference between the arrival times of each

successive point in positive slope is greater than the carry

propagation time of one full adder of an RCA. i.e.,

consider a full adder takes 0.24 ns to propagate the carry.

But the difference between the arrival times of successive

points in positive slope is more than 0.24 ns. So the RCA

is sufficient in this region. Also due to the greater area

and power requirement of CSA, the RCA is the best

choice for the positive slope region.

In the second region the arrival profile of the multiplier

signals are relatively constant. So these bits are waiting

for the carry inputs from the LSB side. Thus a faster

adder is needed in this region. The BCSA and BCLA

perform faster in this region, shown in Fig 3b. On

comparing BCSA with BCLA, the BCLA is slightly

faster than the BCSA and also the area and power of the

mul_8bit

1 2 3 4 5 6 7 8 9101112131415

Bit Positions

Delay(ns)

(a)

add_16bit

D filp-flops

Setting the delay of Arrival

Profile from partial products

2n-2

……P

CLK

Final Adder

1234567891011121314151617

Bit P ositions

Delay(ns)

RCA

CSA

CSLA

CLA

(b)

add_16bit

1 2 3 4 5 6 7 8 9 101112 1314151617

Bit P ositions

Delay(ns)

BCSA

BCSLA

BCLA

(c)

Fig. 2. Arrival profile evaluation to the CPA for N by N multiplier

BCLA are lesser than the BCSA. So we conclude that for

the 8 by 8 multiplier the hybrid adder should have RCA

for the region 1 and BCLA for the region 2.

B. 16 by 16 Dadda multiplier

The input arrival profile for the final adder of 16 by 16

Dadda multiplier is shown in Fig. 4(a). Based on the

arrival profile, here also we can divide this in to 3

regions, in which the first region has a positive slope

from the point 0 to 6 and the second region has a constant

region from point 7 to 25 and the third region has a

negative slope from the point 26 to 31. Since the negative

slope region has more than a few bits we consider this a

separate region.

The performance comparisons, of 7-types of 32-bit

adders for the 16 by 16 multiplier, in terms of output

timing are shown in Fig. 4(b) and Fig. 4(c). The area and

power comparisons are shown in Fig. 4(d) and Fig. 4(e)

respectively. In the positive slope region, here also the

carry propagation time of one full adder is smaller than

the arrival inputs between successive points of the

positive slope region. Thus the RCA is again the best

choice in this positive slope region. In the second region,

we find three adders are working faster with only slight

difference. They are CSLA, BCSA and BCSLA. The

BCSA works faster than the CSLA and BCSLA in this

region. But due to increase in the area and power of the

BCSA, we can choose one of the adders from the CSLA

and BCSLA for the second region. On comparing these

two adders, the CSLA is slightly faster than the BCSLA

(d)

add_16bit

1000

2000

3000

4000

5000

6000

7000

8000

RCA CSA CSLA CLA BCSLA BCSA BCLA

area(um2)

mul_16bit

1 3 5 7 9 1113151719212325272931

Bit P osition s

Delay(ns)

(a)

add_32bit

1 3 5 7 9 111315171921232527293133

Bit Positi ons

Delay(ns)

RCA

CSA

CSLA

CLA

add_16bit

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

RCA CSA CSLA CLA BCSLA BCSA BCLA

(b)

(c)

add_32bit

1 3 5 7 9 111315171921232527293133

Bit Positio ns

Delay(ns)

BCSA

BCSLA

BCLA

add_32bit

2000

4000

6000

8000

10000

12000

14000

16000

RCA CSA CSLA CLA BCSLA BCSA BCLA

area(um2)

(d)

Fig. 4. Final adder analysis for 16 by 16 Dadda multiplier: (a)

arrival profile of PPST, (b) delay analysis of RCA, CSA, CSLA

and CLA, (c) delay analysis BCSA, BCSLA and BCLA, (d) area

analysis, (e) power analysis.

add_32bit

0.5

1.5

2.5

RCA CSA CSLA CLA BCSLA BCSA BCLA

power(mW

power(mW)

(e)

Fig. 3. Final adder analysis for 8 by 8 Dadda multiplier: (a) arrival

profile of PPST, (b) delay analysis of RCA, CSA, CSLA and CLA, (c)

delay analysis BCSA, BCSLA and BCLA, (d) area analysis, (e) power

analysis.

)

(e)

and it leads to more area and power than the BCSLA. So

we suggest that the BCSLA be used for the second

region. In the third region, the BCSA and BCLA are

working faster. The BCLA saves more area and power

than the BCSA; we can therefore choose the BCLA for

the entire third region.

C. 32 by 32 Dadda multiplier

The input arrival profile to the final adder of 32 by 32

Dadda multiplier is shown in Fig. 5(a). Based on the

arrival profile, here also we can divide this in to 3

regions, in which the first region has a positive slope

from the point 0 to 15 and the second region has a

constant region from point 16 to 52 and the third region

has a negative slope from the point 53 to 63.

The performance comparison of 7-types of 64-bit

adders for the 32 by 32 multiplier in terms of output

timing is shown in Fig. 5(b) and Fig. 5(c). The area and

power comparison is shown in Fig. 5(d) and Fig. 5(e)

respectively. In the positive slope region, here also the

carry propagation time of one full adder is smaller than

the arrival inputs between each successive point in the

positive slope region. Thus the RCA is again best suited

in this positive slope region.

In the second region, we find four adders are working

fast with only slight difference. They are CLA, CSLA,

BCSA and BCSLA. The CLA is faster only up to middle

of the second region (approximately 20 bits of 40 bits).

The others perform faster with slight difference in the

entire second region. Since the CLA is faster only half

part of the second region, we can omit CLA in this

region. So we have to choose one of the remaining

adders. The BCSA performs faster than the CSLA and

BCSLA in this region. But it leads to larger power and

area. So we need to choose either CSLA or BCSLA. As

explained earlier, the BCSLA is slightly slower than the

CSLA, but it saves more area and power. So we suggest

that the BCSLA is best in the entire region2. In the third

add_64bit

1 4 7 1013161922252831343740 4346495255586164

Bot Positions

Delay(ns)

BCSA

BCSLA

BCLA

(c)

add_64bit

5000

10000

15000

20000

25000

region, the BCSA and BCLA perform faster with only

slight differences. But the BCLA is faster than the BCSA

in this region. Also the BCLA requires lesser area and

power than the BCSA. So we can suggest BCLA be used

for the third region.

D. 64 by 64 Dadda multiplier

The input arrival profile to the final adder of 64 by 64

Dadda multiplier is shown in Fig. 6(a). Based on the

arrival profile, here also we can divide this into 3 regions,

in which the first region has a positive slope from the

point 0 to 29 and the second region has a constant region

from point 30 to 112 and the third region has a negative

slope from the point 113 to 127.

The performance comparisons, of 7-types of 128-bit

adders for the 64 by 64 multiplier, in terms of output

timing are shown in Fig. 6(b) and Fig. 6(c). The area and

power comparison is shown in Fig. 6(d) and Fig. 6(e)

respectively. In the positive slope region, here also the

300

350

RCA CSA CSLA CLA BCSLA BCSA BCLA

area(um2)

(d)

add_32bit

0.5

1.5

2.5

3.5

RCA CSA CSLA CLA BCSLA BCSA BCLA

power(mW)

(e)

Fig. 5. Final adder analysis for 32 by 32 Dadda multiplier: (a). arrival

profile of PPST, (b) delay analysis of RCA, CSA, CSLA and CLA, (c)

delay analysis BCSA, BCSLA and BCLA, (d) area analysis, (e) power

analysis.

mul_32bit

1 4 7 101316192225283134374043464952555861

Bit P osition s

Delay(ns)

(a)

add_64bit

1 4 7 1013161922252831343740 4346495255586164

Bot Positions

Delay(ns)

RCA

CSA

CSLA

CLA

(b)

carry propagation time of one full adder is smaller than

the arrival inputs between each successive point in the

positive slope region. Thus the RCA is again the best in

this positive slope region. In the second region, we can

find three adders are working fast with only slight

difference. They are CSLA, BCSA and BCSLA. The

CLA is faster only up to the quarter part of the second

region (approximately 20 bits of 80 bits). The BCLA is

faster from the last quarter part of the second region to

the entire remaining third region. Since the effect of CLA

and BCLA is on only certain parts of the second region,

we have omitted these two adders in this region. The

BCSA works faster than the CSLA and BCSLA in this

region. However due to more area and power requirement

of BCSLA, we have to choose one of the adders from

CSLA and BCSLA.

add_128bit

RCA CSA CSLA CLA BCSLA BCSA BCLA

power(mW)

(e)

Fig. 6. Final adder analysis for 64 by 64 Dadda multiplier: (a). arrival

profile of PPST, (b) delay analysis of RCA, CSA, CSLA and CLA, (c)

delay analysis BCSA, BCSLA and BCLA, (d) area analysis, (e) power

analysis.

mul_64bit

1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 109 115 121 127

Bit Positions

Delay(ns)

As explained earlier, the BCSLA is slightly slower than

the CSLA but it saves more area and power. So we can

conclude that the BCSLA is suitable for this region.

In the third region, the BCSLA and BCLA perform

faster with slight difference. But the BCLA is faster than

the BCSLA in this region. Also the BCLA requires lesser

area and power than the BCSA. So we choose BCLA for

the third region.

(a)

add_128bit

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127

Bit Positi ons

Delay(ns)

RCA

CSA

CSLA

CLA

IV. EFFECTIVE PARTITIONING OF MULTIPLIER REGIONS

From the analysis of four operand size multipliers, first

we can propose some simple equations to partition the 3

regions of the multiplier. The Table I (exact) is derived

from the exact arrival bit positions to final adder. By

shifting one or few bits from each region, we can develop

equations for partitioning of the 3 regions. The Table II

(approx) shows the contents of Table I (Exact) after

modification of the bit width of each region.

(b) From the Table I and Table II, we can conclude for N-

bit multiplier (n must be >=4), the first region has ~n/2

bits, the second region has ~n+2x , where x = 0 for 4 to 7

(i.e., n to 2n-1), x = 1 for 8 to 15, etc. and the third region

has ~n/4 bits. These expressions for the width of the three

regions will reduce the design time in findings the three

regions of the final adder.

add_128bit

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127

Bit Positions

Delay(ns)

BCSA

BCSLA

BCLA

(c)

TABLE I

NUMBER OF BITS IN 3 REGIONS (EXACT)

add_128bit

10000

20000

30000

40000

50000

60000

70000

RCA CSA CSLA CLA BCSLA BCSA BCLA

area(um2)

Mul

size Region1

Bit Width Region2

Bit width Region3 Bit

width

8 0-5 6-14 15

16 0-6 7-25 26-31

32 0-15 16-52 53-63

64 0-29 30-112 113-127

(d)

V. FINAL ADDER DESIGN

From the analysis of 28 adders, the RCA is the best

choice in the positive slope region. In the second region,

there are three adders giving good performance. They are

BCSA, BCSLA and BCLA. The overall adder structure

will uses a hybrid adder structure, because each region

uses a unique adder for giving better performance. If we

make a hybrid structure in one region, the layout structure

becomes more complex. So to avoid complexity, we can

use a unique type of adder structure for each region. From

the above analysis, the BCSLA gives optimal

performance than the BCSA and BCLA. So the BCSLA

is the best choice in the second region.

In the third region, both the BCSA and BCLA perform

faster. Due to the large area and power requirement of

BCSA, the BCLA is the best adder in this region. So we

can choose the BCLA for the third region. Thus we arrive

at the optimal final adder structure for parallel multiplier

as shown in Fig. 7. The variable block BCSLA and

BCLA is designed based on square-root method [15].

VI. ASIC IMPLEMENTATION

The designs proposed in this paper have been

developed using Verilog-HDL and synthesized in

Cadence RTL compiler using typical libraries of TSMC

0.18um technology. The synthesized Verilog netlist and

their respective design constraints file are imported to

Cadence SoC Encounter and are used to generate

automated layout from standard cells and placement &

routing [16]. Parasitic extraction is performed using

Encounter’s Native RC extraction tool. The extracted

parasitic RC (SPEF format) is back annotated to

Common Timing Engine in Encounter Platform and

analyzed for static timing delay. The power analysis is

done using Virtuso Ultrasim [17].

Mul size Region1

Bit width Region2

Bit width Region3

Bit width

TABLE II

NUMBER OF BITS IN 3 REGIONS (APPROX)

8 0-3=4 4-13 = 10 14-15=2

16 0-7=8 8-27 = 20 28-31=4

32 0-15=16 16-55 = 40 56-63=8

64 0-31=32 32-111 = 80 112-127=16

VII. CONCLUSION

In this work we have obtained simple equations to

obtain the bit size of the three different regions (positive

slope, constant, negative slope) of the input arrival profile

to the final CPA and analysis has been made on each

region in terms of area, delay and power with various

standard adders to suggest the structure of the optimal

final adder for parallel multipliers. From the observed

analysis results, the RCA, BCSLA & BCLA provide the

optimal performance for positive slope region (width

n/2), constant region (width 5n/4) and negative region

(width n/4) respectively.

ACKNOWLEDGMENT

This work was supported in part by the Integrated

Circuit Design Laboratories, VIT University, Vellore,

India.

REFERENCES

[1] Vojin G. Oklobdzija, Improving Multiplier Design by Using

Improved Column Compression Tree and Optimized Final Adder

in CMOS Technology, IEEE transactions on Very Large Scale

Integration (VLSI) systems, vol. 3, no. 2, June 1995.

[2] P.F. Stelling and V.G. Oklobdzija, "Design strategies for optimal

hybrid final adders in a parallel multiplier," Journal of VLSI

Signal Processing, Vol. 14, no.3, pp.321-31, 1996.

0 0

RCA

Inputs from the

first region

Inputs from the

second region

Inputs from the

third region

Variable Block

BCSLA

Variable

Block RCA

BEC with

multiplexer

Variable Block

BCLA

Variable

Block CLA

BEC with

multiplexer

n/2

n+2

n/4

n/2

n+2

(

n/4

)

[3] Wen-Chang Yeh and Chein-Wei Jen, “High-Speed Booth

Encoded Parallel Multiplier Design”, IEEE Transactions on

Computers, Vol. 49, No. 7, July 2000.

[4] B. Parhami, Computer Arithmetic, Algorithm and Hardware

Design, Oxford University Press, New York, pp. 91-119, 2000.

[5] Jaehong Park,Hung C.Ngo, Joel A. Silberman and Sang H. Dhong,

"470ps 64bit Parallel Binary Adder," Symposium on VLS[

Circuits Digest of Technical Papers, pp. 192-193, 2000.

[6] Y. He, C.H. Chang, and J. Gu, "An area efficient 64-bit square

root carry-select adder for low power applications", IEEE

International Symposium on Circuits and Systems, Vol. 4, pp.

4082 - 4085, 2005.

[7] Jin-Fu Li, Jiunn-Der Yu, and Yu-Jen Huang, "A Design

Methodology for Hybrid Carry-Lookahead/Carry-Select Adders

with Reconfigurability", IEEE International Symposium on

Circuits and Systems (ISCAS 2005), 23-26, May 2005,

Fig. 7. Optimal final adder for all the three regions.

[8] R.P.P. Singh, Parveen Kumar and Balwinder Singh, "Performance

Analysis Of Fast Adders Using VHDL",IEEE International

Conference on Advances in Recent Technologies in

Communication and Computing,189-193, 2009.

[9] B.Ramkumar and Harish M Kittur, “ Low Power and Area

Efficient Carry Select Adder", IEEE Transactions on Very Large

Scale Integration (VLSI) systems, accepted for publication

DOI:10.1109/TVLSI.2010.2101621

[10] B.Ramkumar, Harish M Kittur and P.Mahesh Kannan, “ ASIC

Implementation of Modified Faster Carry Save Adder”, European

Journal of Scientific Research, Vol 42, Issue 1, 2010.

[11] C. S. Wallace, “A Suggestion for a Fast Multiplier,” IEEE

Transactions on Electronic Computers, vol. EC-13, pp. 14-17,

1964.

[12] Luigi Dadda, “Some Schemes for Parallel Multipliers,” Alta

Frequenza, vol. 34, pp. 349-356, August 1965.

[13] K.C. Bickerstaff, E.E. Swartzlander, M.J. Schulte, Analysis of

column compression multipliers, Proceedings of 15th IEEE

Symposium on Computer Arithmeitc,2001.

[14] Townsend.Whitney J, Swartzlander Earl E. Jr and Abraham.Jacob

A, "A comparison of Dadda and Wallace multiplier

delays", Advanced Signal Processing Algorithms,

Architectures, and Implementations, Proceedings of the SPIE,

Volume 5205, pp. 552-560, 2003

[15] Y. He, C. H. Chang, and J. Gu, “An area efficient 64-bit square

root carry-select adder for lowpower applications,” in Proc. IEEE

Int. Symp.Circuits Syst., 2005, vol. 4, pp. 4082–4085.

[16] Cadence, “Encounter user guide”, Version 6.2.4, March 2008.

[17] Virtuoso® Ultrasim User Guide, Version 6.1, November, 2006.

Faster and Energy-Efficient Signed Multipliers

Article

Full-text available

Jan 2013
VLSI Des

We demonstrate faster and energy-efficient column compression multiplication with very small area overheads by using a combination of two techniques: partition of the partial products into two parts for independent parallel column compression and acceleration of the final addition using new hybrid adder structures proposed here. Based on the proposed techniques, 8-b, 16-b, 32-b, and 64-b Wallace (W), Dadda (D), and HPM (H) reduction tree based Baugh-Wooley multipliers are developed and compared with the regular W, D, H based Baugh-Wooley multipliers. The performances of the proposed multipliers are analyzed by evaluating the delay, area, and power, with 65 nm process technologies on interconnect and layout using industry standard design and layout tools. The result analysis shows that the 64-bit proposed multipliers are as much as 29%, 27%, and 21% faster than the regular W, D, H based Baugh-Wooley multipliers, respectively, with a maximum of only 2.4% power overhead. Also, the power-delay products (energy consumption) of the proposed 16-b, 32-b, and 64-b multipliers are significantly lower than those of the regular Baugh-Wooley multiplier. Applicability of the proposed techniques to the Booth-Encoded multipliers is also discussed.

Performance Analysis of Fast Adders Using VHDL

Conference Paper

Full-text available

Jan 2009

ASIC implementation of modified faster carry save adder

Article

May 2010

Digital Adders are the core block of DSP processors. The final carry propagation adder (CPA) structure of many adders constitutes high carry propagation delay and this delay reduces the overall performance of the DSP processor. This paper proposes a simple and efficient approach to reduce the maximum delay of carry propagation in the final stage. Based on this approach a 16, 32 and 64-bit adder architecture has been developed and compared with conventional fast adder architectures. This work identifies the performance of proposed designs in terms of delay-area-power through custom design and layout in 0.18um CMOS process technology. The result analysis shows that the proposed architectures have better performance in reduction of carry propagation delay than contemporary architectures.

A comparison of Dadda and Wallace multiplier delays

Article

Dec 2003
Proceedings of SPIE

The two well-known fast multipliers are those presented by Wallace and Dadda. Both consist of three stages. In the first stage, the partial product matrix is formed. In the second stage, this partial product matrix is reduced to a height of two. In the final stage, these two rows are combined using a carry propagating adder. In the Wallace method, the partial products are reduced as soon as possible. In contrast, Dadda's method does the minimum reduction necessary at each level to perform the reduction in the same number of levels as required by a Wallace multiplier. It is generally assumed that, for a given size, the Wallace multiplier and the Dadda multiplier exhibit similar delay. This is because each uses the same number of pseudo adder levels to perform the partial product reduction. Although the Wallace multiplier uses a slightly smaller carry propagating adder, usually this provides no significant speed advantage. A closer examination of the delays within these two multipliers reveals this assumption to be incorrect. This paper presents a detailed analysis for several sizes of Wallace and Dadda multipliers. These results indicate that despite the presence of the larger carry propagating adder, Dadda's design yields a slightly faster multiplier.

Some schemes for parallel multipliers

Article

Jan 1965

L. Dadda

An Area Efficient 64-bit Square Root Carry-select Adder for Low Power Applications

Conference Paper

Jun 2005

The carry-select method has deemed to be a good compromise between cost and performance in carry propagation adder design. However, the conventional carry-select adder (CSL) is still area-consuming due to the dual ripple-carry adder structure. The excessive area overhead makes CSL relatively unattractive but this has been circumvented by the use of an add-one circuit introduced recently. In this paper, an area efficient square root CSL scheme based on a new first zero detection logic is proposed. The proposed CSL witnesses a notable power-delay and area-delay performance improvement by virtue of proper exploitation of logic structure and circuit technique. For 64-bit addition, our proposed CSL requires 44% fewer transistors than the conventional one. Simulation results indicate that our proposed CSL can complete 64-bit addition in 1.50 ns and dissipate only 0.35 mW at 1.8V in TSMC 0.18 μm CMOS technology.

Low-Power and Area-Efficient Carry Select Adder

Article

Mar 2012

Carry Select Adder (CSLA) is one of the fastest adders used in many data-processing processors to perform fast arithmetic functions. From the structure of the CSLA, it is clear that there is scope for reducing the area and power consumption in the CSLA. This work uses a simple and efficient gate-level modification to significantly reduce the area and power of the CSLA. Based on this modification 8-, 16-, 32-, and 64-b square-root CSLA (SQRT CSLA) architecture have been developed and compared with the regular SQRT CSLA architecture. The proposed design has reduced area and power as compared with the regular SQRT CSLA with only a slight increase in the delay. This work evaluates the performance of the proposed designs in terms of delay, area, power, and their products by hand with logical effort and through custom design and layout in 0.18- $mu$ m CMOS process technology. The results analysis shows that the proposed CSLA structure is better than the regular SQRT CSLA.

Computer arithmetic - algorithms and hardware designs

Book

Jan 2000

B. Parhami

Design Strategies for Optimal Hybrid Final Adders in a Parallel Multiplier

Article

Dec 1996

In this paper we address the problem of adding twon-bit numbers when the bit arrival times are arbitrary (but known in advance). In particular we address a simplified version of the problem where the input arrival times for theith significant bits of both addends are the same, and the arrival timest i have a profile of the form:$$t_0 \leqslant t_1 \leqslant \cdot \cdot \cdot< t_k = t_{k + 1} = \cdot \cdot \cdot = t_p > t_{p + 1} \geqslant \cdot \cdot \cdot \geqslant t_{n - 1} $$ This profile is important because it matches the signal arrival time profile of the reduced partial products in a parallel multiplier before they are summed in the final adder. In this paper we present a design strategy specific to arrival time profiles generated by partial product reduction trees constructed by optimal application of the Three Dimensional Method presented by Oklobdzija, Villeger, and Liu and subsequently analyzed by Martel, Oklobdzija, Ravi, and Stelling. This strategy can be used to obtain adders for any arrival time profile that matches the above form, as well as a broad class of arrival time profiles where even greater variation in the input times is allowed. Finally, we show that our designs significantly out-perform the standard adder designs for the uniform signal arrival profile, yielding faster adders that (for these profiles) are also simpler and use fewer gates.

A design methodology for hybrid carry-lookahead/carry-select adders with reconfigurability

Conference Paper

Jun 2005

This paper presents a design methodology of reconfigurable hybrid carry lookahead/carry select adders (CLSA). A novel partition scheme is used to divide a large hybrid CLSA into multiple small ones with blocking specific inputs of the carry lookahead unit in the hybrid CLSA. The partition scheme incurs no delay penalty regardless of the size of adders. Moreover, the additional area cost is very small. For example, a reconfigurable 16-bit hybrid CLSA with four different partition configurations needs additional 6 two-input AND gates and three two-input multiplexers. Simulation results show that the delay of a 64-bit reconfigurable CLSA is only about 1.38 ns in 0.18 μm technology.

Analysis of column compression multipliers

Conference Paper

Feb 2001

Column compression multipliers are frequently used in high-performance computer systems due to their short worst case delay. This paper examines the area, delay, and power characteristics of Dadda (1965) and Wallace (1964) column compression multipliers in deep submicron technology. Our analysis shows that Wallace multipliers have slightly more area and approximately the same worst case delay as Dadda multipliers. It also shows the importance of considering parasitic capacitances when determining the delay of column compression multipliers, since parasitics can increase the delay of the multiplier by over 60%. As multiplier size increases, the ratio of power to area also increases, due to longer interconnect lines and increased glitching

Optimal Final Carry Propagate Adder Design for Parallel Multipliers

Abstract

Recommended publications

ASIC implementation of modified faster carry save adder

Optimization of hybrid final adder for the high performance multiplier

Faster and Energy-Efficient Signed Multipliers

Low-Power and Area-Efficient Carry Select Adder

Faster Energy Efficient Dadda Based Baugh-Wooley Multipliers