ArticlePDF Available

Optimal Final Carry Propagate Adder Design for Parallel Multipliers

Authors:

Abstract

Based on the ASIC layout level simulation of 7 types of adder structures each of four different sizes, i.e. a total of 28 adders, we propose expressions for the width of each of the three regions of the final Carry Propagate Adder (CPA) to be used in parallel multipliers. We also propose the types of adders to be used in each region that would lead to the optimal performance of the hybrid final adders in parallel multipliers. This work evaluates the complete performance of the analyzed designs in terms of delay, area, power through custom design and layout in 0.18 um CMOS process technology.
Optimal Final Carry Propagate Adder Design for
Parallel Multipliers
Ramkumar B and Harish M Kittur
Abstract- Based on the ASIC layout level simulation of 7
types of adder structures each of four different sizes, i.e. a
total of 28 adders, we propose expressions for the width of
each of the three regions of the final Carry Propagate Adder
(CPA) to be used in parallel multipliers. We also propose
the types of adders to be used in each region that would lead
to the optimal performance of the hybrid final adders in
parallel multipliers. This work evaluates the complete
performance of the analyzed designs in terms of delay, area,
power through custom design and layout in 0.18 um CMOS
process technology.
Index terms – ASIC (Application Specific Integrated
Circuit), optimal hybrid CPA, Parallel multiplier, low
power, area efficient.
I. INTRODUCTION
The critical signal path in a parallel multiplier is
divided into three domains: AND gate array, PPST
(Partial Product Summation Tree) and the final CPA. The
delay introduced by the AND gate is relatively small
compared to the other two components, especially for the
large size multiplier. This delay component is also
relatively independent of the size of the multiplier. The
delay introduced by the PPST and the final CPA
constitutes a dominant component of the delay in the
multiplier [1].
Hybrid CPA have been proposed earlier with detailed
investigations on the final addition of parallel multipliers
[1]-[3]. It is well known that the signals applied to the
inputs of the CPA arrive first at the ends of the CPA and
the last ones are those in the middle of the CPA. So the
determination of the exact arrival time to final adder is of
prime importance in the design of the optimal final adder.
We have therefore analyzed the arrival time from the
PPST through layout implementation and based on those
arrival times, the inputs has been applied to the 7 type of
adders for 4 different bit sizes ( total of 28 adders). The
analysis is done by using industry standard tool and based
on the post layout simulation results we have designed
the optimal final structure. The investigation includes 8
by 8, 16 by 16, 32 by 32 and 64 by 64 Dadda multiplier
with the final adders being 16, 32, 64 and 128-bit Ripple
Carry Adder (RCA), Carry Save Adder (CSA), Carry
Select Adder (CSLA), Carry Look Ahead Adder (CLA)
and BEC (Binary to Excess1) based adders called here as
BEC Carry Select Adder (BCSLA), BEC Carry Save
Adder (BCSA) and BEC Carry Look Ahead adder
(BCLA)[4]-[10].
This paper is structured as follows; Section II deals
with the design of the PPST based on Dadda algorithm
and analysis of the signal arrival profile from the PPST.
The analysis of the performance of various adders in
terms of area, delay and power is in Section III. The
equations for efficient partitioning of the multiplier
region are developed in Section IV. The final adder
design and ASIC implementation details are provided in
Section V and VI respectively. Finally the work is
concluded in Section VII.
II. ANALYSIS OF PPST SIGNAL ARRIVAL PROFILE
The basic top-level implementation for N by N
unsigned parallel multiplier without CPA is shown in Fig.
1. To analyze the exact arrival time from the PPST, the
multiplier is implemented without CPA.
Signal Buffering
In order to determine typical signal arrival profile and
drive strengths, D flip-flops are used on the primary
inputs & outputs. D flip-flops drive multiple buffers to
distribute input signals to N2 AND gates. Delay
simulations were performed for each cell library to
resolve,
a) The maximum number of buffers that a single D flip-
flop can drive.
b) The maximum number of AND gate inputs that a
single buffer can drive.
N
N N
P
0
P
s
……P
1
CLK
Buffers
AND Gate Array
Partial Product Reduction
D flip-flops + load caps
D flip-flops D flip-flops
Multiplicand (A) Multiplier (B)
N
This work was supported in part by the Integrated Circuit Design
Laboratories, VIT University, Vellore, India.
B.Ramkumar1 and Harish Kittur2 are with the VLSI division, School
of Electronics Engineering, VIT University, Vellore, India. (email:
ramkumar.b@vit.ac.in1; kittur@vit.ac.in2) Fig. 1. Top-level implementation of N by N multiplier without CPA
Partial Product Reduction Tree
The Wallace and Dadda methods are the popular
partial product reduction algorithm for fast multipliers
[11]-[12], but a closer examination of the delays and area
within these two multiplier shows that the Dadda
multiplier is slightly faster and area efficient than the
Wallace multiplier [13]-[14].
For the reduction of the N by N partial matrix, Dadda
proposes a sequence of matrix heights that are determined
by working back from the final two-row matrix. In order
to implement the minimum number of reduction stages,
the height of each intermediate matrix is limited to the
largest integer that is no more than 1.5 times the height of
its successor. In this paper, the PPST is implemented
based on the Dadda algorithm.
III. FINAL ADDER ANALYSIS
Once the partial product matrix has been reduced to a
height of two, the final stage CPA length is determined in
the Dadda approach as below,
CPA length = 2N – 2
Fig.1 details the connection of a AND gate array (p0) and
compression strategy (p1 to ps) to a D flip-flop and
capacitive load which finds the arrival time from the
inputs to the final CPA. The evaluated arrival profile is
fixed as input delay to the CPA using timing constraint
file as shown in Fig. 2.
A. 8 by 8 Dadda multiplier
The input arrival profile to the final CPA of 8 by 8
Dadda multiplier is shown in Fig. 3a. Based on the arrival
profile it is divided into 3 regions, in which the first
region has a positive slope from the point 0 to 5 and the
second region has a constant region from point 6 to 14
and the third region has a negative slope which is point
15. Since the negative slope has only one point we can
include this region within the second region.
The performance comparisons, of the 7-types of 16-bit
adders for the 8 by 8 multiplier, in terms of output timing
are shown in Fig. 3(b) and Fig. 3(c). The area and power
comparison is shown in Fig. 3(d) and Fig. 3(e)
respectively. These values are predicted from the post
layout simulation of each adder. In each of the 3 region,
we note that more than one adder is performing well with
only slight differences. In the positive slope region of the
comparison graph, the RCA and CSA are performing
faster. The difference between the arrival times of each
successive point in positive slope is greater than the carry
propagation time of one full adder of an RCA. i.e.,
consider a full adder takes 0.24 ns to propagate the carry.
But the difference between the arrival times of successive
points in positive slope is more than 0.24 ns. So the RCA
is sufficient in this region. Also due to the greater area
and power requirement of CSA, the RCA is the best
choice for the positive slope region.
In the second region the arrival profile of the multiplier
signals are relatively constant. So these bits are waiting
for the carry inputs from the LSB side. Thus a faster
adder is needed in this region. The BCSA and BCLA
perform faster in this region, shown in Fig 3b. On
comparing BCSA with BCLA, the BCLA is slightly
faster than the BCSA and also the area and power of the
mul_8bit
0
1
2
3
4
5
1 2 3 4 5 6 7 8 9101112131415
Bit Positions
Delay(ns)
(a)
add_16bit
0
1
2
3
4
5
6
7
8
9
D filp-flops
D filp-flops
Setting the delay of Arrival
Profile from partial products
P
2n-2
……P
1
P
s
……P
1
P
s
……P
1
CLK
Final Adder
1234567891011121314151617
Bit P ositions
Delay(ns)
RCA
CSA
CSLA
CLA
(b)
add_16bit
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8 9 101112 1314151617
Bit P ositions
Delay(ns)
BCSA
BCSLA
BCLA
(c)
Fig. 2. Arrival profile evaluation to the CPA for N by N multiplier
BCLA are lesser than the BCSA. So we conclude that for
the 8 by 8 multiplier the hybrid adder should have RCA
for the region 1 and BCLA for the region 2.
B. 16 by 16 Dadda multiplier
The input arrival profile for the final adder of 16 by 16
Dadda multiplier is shown in Fig. 4(a). Based on the
arrival profile, here also we can divide this in to 3
regions, in which the first region has a positive slope
from the point 0 to 6 and the second region has a constant
region from point 7 to 25 and the third region has a
negative slope from the point 26 to 31. Since the negative
slope region has more than a few bits we consider this a
separate region.
The performance comparisons, of 7-types of 32-bit
adders for the 16 by 16 multiplier, in terms of output
timing are shown in Fig. 4(b) and Fig. 4(c). The area and
power comparisons are shown in Fig. 4(d) and Fig. 4(e)
respectively. In the positive slope region, here also the
carry propagation time of one full adder is smaller than
the arrival inputs between successive points of the
positive slope region. Thus the RCA is again the best
choice in this positive slope region. In the second region,
we find three adders are working faster with only slight
difference. They are CSLA, BCSA and BCSLA. The
BCSA works faster than the CSLA and BCSLA in this
region. But due to increase in the area and power of the
BCSA, we can choose one of the adders from the CSLA
and BCSLA for the second region. On comparing these
two adders, the CSLA is slightly faster than the BCSLA
(d)
add_16bit
0
1000
2000
3000
4000
5000
6000
7000
8000
RCA CSA CSLA CLA BCSLA BCSA BCLA
area(um2)
mul_16bit
0
1
2
3
4
5
6
7
1 3 5 7 9 1113151719212325272931
Bit P osition s
Delay(ns)
(a)
add_32bit
0
2
4
6
8
10
12
14
1 3 5 7 9 111315171921232527293133
Bit Positi ons
Delay(ns)
RCA
CSA
CSLA
CLA
add_16bit
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
RCA CSA CSLA CLA BCSLA BCSA BCLA
(b)
(c)
add_32bit
0
2
4
6
8
10
1 3 5 7 9 111315171921232527293133
Bit Positio ns
Delay(ns)
BCSA
BCSLA
BCLA
add_32bit
0
2000
4000
6000
8000
10000
12000
14000
16000
RCA CSA CSLA CLA BCSLA BCSA BCLA
area(um2)
(d)
Fig. 4. Final adder analysis for 16 by 16 Dadda multiplier: (a)
arrival profile of PPST, (b) delay analysis of RCA, CSA, CSLA
and CLA, (c) delay analysis BCSA, BCSLA and BCLA, (d) area
analysis, (e) power analysis.
add_32bit
0
0.5
1
1.5
2
2.5
RCA CSA CSLA CLA BCSLA BCSA BCLA
power(mW
power(mW)
(e)
Fig. 3. Final adder analysis for 8 by 8 Dadda multiplier: (a) arrival
profile of PPST, (b) delay analysis of RCA, CSA, CSLA and CLA, (c)
delay analysis BCSA, BCSLA and BCLA, (d) area analysis, (e) power
analysis.
)
(e)
and it leads to more area and power than the BCSLA. So
we suggest that the BCSLA be used for the second
region. In the third region, the BCSA and BCLA are
working faster. The BCLA saves more area and power
than the BCSA; we can therefore choose the BCLA for
the entire third region.
C. 32 by 32 Dadda multiplier
The input arrival profile to the final adder of 32 by 32
Dadda multiplier is shown in Fig. 5(a). Based on the
arrival profile, here also we can divide this in to 3
regions, in which the first region has a positive slope
from the point 0 to 15 and the second region has a
constant region from point 16 to 52 and the third region
has a negative slope from the point 53 to 63.
The performance comparison of 7-types of 64-bit
adders for the 32 by 32 multiplier in terms of output
timing is shown in Fig. 5(b) and Fig. 5(c). The area and
power comparison is shown in Fig. 5(d) and Fig. 5(e)
respectively. In the positive slope region, here also the
carry propagation time of one full adder is smaller than
the arrival inputs between each successive point in the
positive slope region. Thus the RCA is again best suited
in this positive slope region.
In the second region, we find four adders are working
fast with only slight difference. They are CLA, CSLA,
BCSA and BCSLA. The CLA is faster only up to middle
of the second region (approximately 20 bits of 40 bits).
The others perform faster with slight difference in the
entire second region. Since the CLA is faster only half
part of the second region, we can omit CLA in this
region. So we have to choose one of the remaining
adders. The BCSA performs faster than the CSLA and
BCSLA in this region. But it leads to larger power and
area. So we need to choose either CSLA or BCSLA. As
explained earlier, the BCSLA is slightly slower than the
CSLA, but it saves more area and power. So we suggest
that the BCSLA is best in the entire region2. In the third
add_64bit
0
2
4
6
8
10
12
14
16
18
1 4 7 1013161922252831343740 4346495255586164
Bot Positions
Delay(ns)
BCSA
BCSLA
BCLA
(c)
add_64bit
0
5000
10000
15000
20000
25000
region, the BCSA and BCLA perform faster with only
slight differences. But the BCLA is faster than the BCSA
in this region. Also the BCLA requires lesser area and
power than the BCSA. So we can suggest BCLA be used
for the third region.
D. 64 by 64 Dadda multiplier
The input arrival profile to the final adder of 64 by 64
Dadda multiplier is shown in Fig. 6(a). Based on the
arrival profile, here also we can divide this into 3 regions,
in which the first region has a positive slope from the
point 0 to 29 and the second region has a constant region
from point 30 to 112 and the third region has a negative
slope from the point 113 to 127.
The performance comparisons, of 7-types of 128-bit
adders for the 64 by 64 multiplier, in terms of output
timing are shown in Fig. 6(b) and Fig. 6(c). The area and
power comparison is shown in Fig. 6(d) and Fig. 6(e)
respectively. In the positive slope region, here also the
300
350
00
00
RCA CSA CSLA CLA BCSLA BCSA BCLA
area(um2)
(d)
add_32bit
0
0.5
1
1.5
2
2.5
3
3.5
4
RCA CSA CSLA CLA BCSLA BCSA BCLA
power(mW)
(e)
Fig. 5. Final adder analysis for 32 by 32 Dadda multiplier: (a). arrival
profile of PPST, (b) delay analysis of RCA, CSA, CSLA and CLA, (c)
delay analysis BCSA, BCSLA and BCLA, (d) area analysis, (e) power
analysis.
mul_32bit
0
1
2
3
4
5
6
7
8
1 4 7 101316192225283134374043464952555861
Bit P osition s
Delay(ns)
(a)
add_64bit
0
5
10
15
20
25
1 4 7 1013161922252831343740 4346495255586164
Bot Positions
Delay(ns)
RCA
CSA
CSLA
CLA
(b)
carry propagation time of one full adder is smaller than
the arrival inputs between each successive point in the
positive slope region. Thus the RCA is again the best in
this positive slope region. In the second region, we can
find three adders are working fast with only slight
difference. They are CSLA, BCSA and BCSLA. The
CLA is faster only up to the quarter part of the second
region (approximately 20 bits of 80 bits). The BCLA is
faster from the last quarter part of the second region to
the entire remaining third region. Since the effect of CLA
and BCLA is on only certain parts of the second region,
we have omitted these two adders in this region. The
BCSA works faster than the CSLA and BCSLA in this
region. However due to more area and power requirement
of BCSLA, we have to choose one of the adders from
CSLA and BCSLA.
add_128bit
0
1
2
3
4
5
6
7
8
RCA CSA CSLA CLA BCSLA BCSA BCLA
power(mW)
(e)
Fig. 6. Final adder analysis for 64 by 64 Dadda multiplier: (a). arrival
profile of PPST, (b) delay analysis of RCA, CSA, CSLA and CLA, (c)
delay analysis BCSA, BCSLA and BCLA, (d) area analysis, (e) power
analysis.
mul_64bit
0
2
4
6
8
10
1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 109 115 121 127
Bit Positions
Delay(ns)
As explained earlier, the BCSLA is slightly slower than
the CSLA but it saves more area and power. So we can
conclude that the BCSLA is suitable for this region.
In the third region, the BCSLA and BCLA perform
faster with slight difference. But the BCLA is faster than
the BCSLA in this region. Also the BCLA requires lesser
area and power than the BCSA. So we choose BCLA for
the third region.
(a)
add_128bit
0
5
10
15
20
25
30
35
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127
Bit Positi ons
Delay(ns)
RCA
CSA
CSLA
CLA
IV. EFFECTIVE PARTITIONING OF MULTIPLIER REGIONS
From the analysis of four operand size multipliers, first
we can propose some simple equations to partition the 3
regions of the multiplier. The Table I (exact) is derived
from the exact arrival bit positions to final adder. By
shifting one or few bits from each region, we can develop
equations for partitioning of the 3 regions. The Table II
(approx) shows the contents of Table I (Exact) after
modification of the bit width of each region.
(b) From the Table I and Table II, we can conclude for N-
bit multiplier (n must be >=4), the first region has ~n/2
bits, the second region has ~n+2x , where x = 0 for 4 to 7
(i.e., n to 2n-1), x = 1 for 8 to 15, etc. and the third region
has ~n/4 bits. These expressions for the width of the three
regions will reduce the design time in findings the three
regions of the final adder.
add_128bit
0
5
10
15
20
25
30
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127
Bit Positions
Delay(ns)
BCSA
BCSLA
BCLA
(c)
TABLE I
NUMBER OF BITS IN 3 REGIONS (EXACT)
add_128bit
0
10000
20000
30000
40000
50000
60000
70000
RCA CSA CSLA CLA BCSLA BCSA BCLA
area(um2)
Mul
size Region1
Bit Width Region2
Bit width Region3 Bit
width
8 0-5 6-14 15
16 0-6 7-25 26-31
32 0-15 16-52 53-63
64 0-29 30-112 113-127
(d)
`
V. FINAL ADDER DESIGN
From the analysis of 28 adders, the RCA is the best
choice in the positive slope region. In the second region,
there are three adders giving good performance. They are
BCSA, BCSLA and BCLA. The overall adder structure
will uses a hybrid adder structure, because each region
uses a unique adder for giving better performance. If we
make a hybrid structure in one region, the layout structure
becomes more complex. So to avoid complexity, we can
use a unique type of adder structure for each region. From
the above analysis, the BCSLA gives optimal
performance than the BCSA and BCLA. So the BCSLA
is the best choice in the second region.
In the third region, both the BCSA and BCLA perform
faster. Due to the large area and power requirement of
BCSA, the BCLA is the best adder in this region. So we
can choose the BCLA for the third region. Thus we arrive
at the optimal final adder structure for parallel multiplier
as shown in Fig. 7. The variable block BCSLA and
BCLA is designed based on square-root method [15].
VI. ASIC IMPLEMENTATION
The designs proposed in this paper have been
developed using Verilog-HDL and synthesized in
Cadence RTL compiler using typical libraries of TSMC
0.18um technology. The synthesized Verilog netlist and
their respective design constraints file are imported to
Cadence SoC Encounter and are used to generate
automated layout from standard cells and placement &
routing [16]. Parasitic extraction is performed using
Encounter’s Native RC extraction tool. The extracted
parasitic RC (SPEF format) is back annotated to
Common Timing Engine in Encounter Platform and
analyzed for static timing delay. The power analysis is
done using Virtuso Ultrasim [17].
Mul size Region1
Bit width Region2
Bit width Region3
Bit width
TABLE II
NUMBER OF BITS IN 3 REGIONS (APPROX)
8 0-3=4 4-13 = 10 14-15=2
16 0-7=8 8-27 = 20 28-31=4
32 0-15=16 16-55 = 40 56-63=8
64 0-31=32 32-111 = 80 112-127=16
VII. CONCLUSION
In this work we have obtained simple equations to
obtain the bit size of the three different regions (positive
slope, constant, negative slope) of the input arrival profile
to the final CPA and analysis has been made on each
region in terms of area, delay and power with various
standard adders to suggest the structure of the optimal
final adder for parallel multipliers. From the observed
analysis results, the RCA, BCSLA & BCLA provide the
optimal performance for positive slope region (width
n/2), constant region (width 5n/4) and negative region
(width n/4) respectively.
ACKNOWLEDGMENT
This work was supported in part by the Integrated
Circuit Design Laboratories, VIT University, Vellore,
India.
REFERENCES
[1] Vojin G. Oklobdzija, Improving Multiplier Design by Using
Improved Column Compression Tree and Optimized Final Adder
in CMOS Technology, IEEE transactions on Very Large Scale
Integration (VLSI) systems, vol. 3, no. 2, June 1995.
[2] P.F. Stelling and V.G. Oklobdzija, "Design strategies for optimal
hybrid final adders in a parallel multiplier," Journal of VLSI
Signal Processing, Vol. 14, no.3, pp.321-31, 1996.
0 0
RCA
Inputs from the
first region
Inputs from the
second region
Inputs from the
third region
Variable Block
BCSLA
Variable
Block RCA
BEC with
multiplexer
Variable Block
BCLA
Variable
Block CLA
BEC with
multiplexer
n/2
n+2
x
n/4
n/2
n+2
x
(
n/4
)
+1
[3] Wen-Chang Yeh and Chein-Wei Jen, “High-Speed Booth
Encoded Parallel Multiplier Design”, IEEE Transactions on
Computers, Vol. 49, No. 7, July 2000.
[4] B. Parhami, Computer Arithmetic, Algorithm and Hardware
Design, Oxford University Press, New York, pp. 91-119, 2000.
[5] Jaehong Park,Hung C.Ngo, Joel A. Silberman and Sang H. Dhong,
"470ps 64bit Parallel Binary Adder," Symposium on VLS[
Circuits Digest of Technical Papers, pp. 192-193, 2000.
[6] Y. He, C.H. Chang, and J. Gu, "An area efficient 64-bit square
root carry-select adder for low power applications", IEEE
International Symposium on Circuits and Systems, Vol. 4, pp.
4082 - 4085, 2005.
[7] Jin-Fu Li, Jiunn-Der Yu, and Yu-Jen Huang, "A Design
Methodology for Hybrid Carry-Lookahead/Carry-Select Adders
with Reconfigurability", IEEE International Symposium on
Circuits and Systems (ISCAS 2005), 23-26, May 2005,
Fig. 7. Optimal final adder for all the three regions.
[8] R.P.P. Singh, Parveen Kumar and Balwinder Singh, "Performance
Analysis Of Fast Adders Using VHDL",IEEE International
Conference on Advances in Recent Technologies in
Communication and Computing,189-193, 2009.
[9] B.Ramkumar and Harish M Kittur, “ Low Power and Area
Efficient Carry Select Adder", IEEE Transactions on Very Large
Scale Integration (VLSI) systems, accepted for publication
DOI:10.1109/TVLSI.2010.2101621
[10] B.Ramkumar, Harish M Kittur and P.Mahesh Kannan, “ ASIC
Implementation of Modified Faster Carry Save Adder”, European
Journal of Scientific Research, Vol 42, Issue 1, 2010.
[11] C. S. Wallace, “A Suggestion for a Fast Multiplier,” IEEE
Transactions on Electronic Computers, vol. EC-13, pp. 14-17,
1964.
[12] Luigi Dadda, “Some Schemes for Parallel Multipliers,” Alta
Frequenza, vol. 34, pp. 349-356, August 1965.
[13] K.C. Bickerstaff, E.E. Swartzlander, M.J. Schulte, Analysis of
column compression multipliers, Proceedings of 15th IEEE
Symposium on Computer Arithmeitc,2001.
[14] Townsend.Whitney J, Swartzlander Earl E. Jr and Abraham.Jacob
A, "A comparison of Dadda and Wallace multiplier
delays", Advanced Signal Processing Algorithms,
Architectures, and Implementations, Proceedings of the SPIE,
Volume 5205, pp. 552-560, 2003
[15] Y. He, C. H. Chang, and J. Gu, “An area efficient 64-bit square
root carry-select adder for lowpower applications,” in Proc. IEEE
Int. Symp.Circuits Syst., 2005, vol. 4, pp. 4082–4085.
[16] Cadence, “Encounter user guide”, Version 6.2.4, March 2008.
[17] Virtuoso® Ultrasim User Guide, Version 6.1, November, 2006.
... Arrival profile aware hybrid adders have been reported earlier [12,13]. Recently, further investigations on the same are reported in [16]. This paper is structured as follows. ...
Article
Full-text available
We demonstrate faster and energy-efficient column compression multiplication with very small area overheads by using a combination of two techniques: partition of the partial products into two parts for independent parallel column compression and acceleration of the final addition using new hybrid adder structures proposed here. Based on the proposed techniques, 8-b, 16-b, 32-b, and 64-b Wallace (W), Dadda (D), and HPM (H) reduction tree based Baugh-Wooley multipliers are developed and compared with the regular W, D, H based Baugh-Wooley multipliers. The performances of the proposed multipliers are analyzed by evaluating the delay, area, and power, with 65 nm process technologies on interconnect and layout using industry standard design and layout tools. The result analysis shows that the 64-bit proposed multipliers are as much as 29%, 27%, and 21% faster than the regular W, D, H based Baugh-Wooley multipliers, respectively, with a maximum of only 2.4% power overhead. Also, the power-delay products (energy consumption) of the proposed 16-b, 32-b, and 64-b multipliers are significantly lower than those of the regular Baugh-Wooley multiplier. Applicability of the proposed techniques to the Booth-Encoded multipliers is also discussed.
Article
Digital Adders are the core block of DSP processors. The final carry propagation adder (CPA) structure of many adders constitutes high carry propagation delay and this delay reduces the overall performance of the DSP processor. This paper proposes a simple and efficient approach to reduce the maximum delay of carry propagation in the final stage. Based on this approach a 16, 32 and 64-bit adder architecture has been developed and compared with conventional fast adder architectures. This work identifies the performance of proposed designs in terms of delay-area-power through custom design and layout in 0.18um CMOS process technology. The result analysis shows that the proposed architectures have better performance in reduction of carry propagation delay than contemporary architectures.
Article
The two well-known fast multipliers are those presented by Wallace and Dadda. Both consist of three stages. In the first stage, the partial product matrix is formed. In the second stage, this partial product matrix is reduced to a height of two. In the final stage, these two rows are combined using a carry propagating adder. In the Wallace method, the partial products are reduced as soon as possible. In contrast, Dadda's method does the minimum reduction necessary at each level to perform the reduction in the same number of levels as required by a Wallace multiplier. It is generally assumed that, for a given size, the Wallace multiplier and the Dadda multiplier exhibit similar delay. This is because each uses the same number of pseudo adder levels to perform the partial product reduction. Although the Wallace multiplier uses a slightly smaller carry propagating adder, usually this provides no significant speed advantage. A closer examination of the delays within these two multipliers reveals this assumption to be incorrect. This paper presents a detailed analysis for several sizes of Wallace and Dadda multipliers. These results indicate that despite the presence of the larger carry propagating adder, Dadda's design yields a slightly faster multiplier.
Conference Paper
The carry-select method has deemed to be a good compromise between cost and performance in carry propagation adder design. However, the conventional carry-select adder (CSL) is still area-consuming due to the dual ripple-carry adder structure. The excessive area overhead makes CSL relatively unattractive but this has been circumvented by the use of an add-one circuit introduced recently. In this paper, an area efficient square root CSL scheme based on a new first zero detection logic is proposed. The proposed CSL witnesses a notable power-delay and area-delay performance improvement by virtue of proper exploitation of logic structure and circuit technique. For 64-bit addition, our proposed CSL requires 44% fewer transistors than the conventional one. Simulation results indicate that our proposed CSL can complete 64-bit addition in 1.50 ns and dissipate only 0.35 mW at 1.8V in TSMC 0.18 μm CMOS technology.
Article
Carry Select Adder (CSLA) is one of the fastest adders used in many data-processing processors to perform fast arithmetic functions. From the structure of the CSLA, it is clear that there is scope for reducing the area and power consumption in the CSLA. This work uses a simple and efficient gate-level modification to significantly reduce the area and power of the CSLA. Based on this modification 8-, 16-, 32-, and 64-b square-root CSLA (SQRT CSLA) architecture have been developed and compared with the regular SQRT CSLA architecture. The proposed design has reduced area and power as compared with the regular SQRT CSLA with only a slight increase in the delay. This work evaluates the performance of the proposed designs in terms of delay, area, power, and their products by hand with logical effort and through custom design and layout in 0.18- $mu$ m CMOS process technology. The results analysis shows that the proposed CSLA structure is better than the regular SQRT CSLA.
Article
In this paper we address the problem of adding twon-bit numbers when the bit arrival times are arbitrary (but known in advance). In particular we address a simplified version of the problem where the input arrival times for theith significant bits of both addends are the same, and the arrival timest i have a profile of the form:$$t_0 \leqslant t_1 \leqslant \cdot \cdot \cdot< t_k = t_{k + 1} = \cdot \cdot \cdot = t_p > t_{p + 1} \geqslant \cdot \cdot \cdot \geqslant t_{n - 1} $$ This profile is important because it matches the signal arrival time profile of the reduced partial products in a parallel multiplier before they are summed in the final adder. In this paper we present a design strategy specific to arrival time profiles generated by partial product reduction trees constructed by optimal application of the Three Dimensional Method presented by Oklobdzija, Villeger, and Liu and subsequently analyzed by Martel, Oklobdzija, Ravi, and Stelling. This strategy can be used to obtain adders for any arrival time profile that matches the above form, as well as a broad class of arrival time profiles where even greater variation in the input times is allowed. Finally, we show that our designs significantly out-perform the standard adder designs for the uniform signal arrival profile, yielding faster adders that (for these profiles) are also simpler and use fewer gates.
Conference Paper
This paper presents a design methodology of reconfigurable hybrid carry lookahead/carry select adders (CLSA). A novel partition scheme is used to divide a large hybrid CLSA into multiple small ones with blocking specific inputs of the carry lookahead unit in the hybrid CLSA. The partition scheme incurs no delay penalty regardless of the size of adders. Moreover, the additional area cost is very small. For example, a reconfigurable 16-bit hybrid CLSA with four different partition configurations needs additional 6 two-input AND gates and three two-input multiplexers. Simulation results show that the delay of a 64-bit reconfigurable CLSA is only about 1.38 ns in 0.18 μm technology.
Conference Paper
Column compression multipliers are frequently used in high-performance computer systems due to their short worst case delay. This paper examines the area, delay, and power characteristics of Dadda (1965) and Wallace (1964) column compression multipliers in deep submicron technology. Our analysis shows that Wallace multipliers have slightly more area and approximately the same worst case delay as Dadda multipliers. It also shows the importance of considering parasitic capacitances when determining the delay of column compression multipliers, since parasitics can increase the delay of the multiplier by over 60%. As multiplier size increases, the ratio of power to area also increases, due to longer interconnect lines and increased glitching