Article

Reducing the Computation Time in (Short Bit-Width) Two's Complement Multipliers

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Two's complement multipliers are important for a wide range of applications. In this paper, we present a technique to reduce by one row the maximum height of the partial product array generated by a radix-4 Modified Booth Encoded multiplier, without any increase in the delay of the partial product generation stage. This reduction may allow for a faster compression of the partial product array and regular layouts. This technique is of particular interest in all multiplier designs, but especially in short bit-width two's complement multipliers for high-performance embedded cores. The proposed method is general and can be extended to higher radix encodings, as well as to any size square and m times n rectangular multipliers. We evaluated the proposed approach by comparison with some other possible solutions; the results based on a rough theoretical analysis and on logic synthesis showed its efficiency in terms of both area and delay.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... NBPR reduces the PP row height without computing Negative Encoding (NE) and Sign Extension (SE). NE and SE are the common terms in existing bit pair recoding algorithms such as Booth [16][17][18] that incurred additional PP row which in turn occupies more area, power, and delay. NBPR generates the PP based on the number of 1's in 4-bit recoded multiplier groups. ...
... Small-sized and high accuracy applications need 88  exact multiplication, whilst, the large bit-sized errorresilient application uses the 16 8  or 16 16  inexact multiplication based on the requirement. A fixed width multiplier is the utmost common multiplier used in image processing applications. ...
... Small-sized and high accuracy applications need 88  exact multiplication, whilst, the large bit-sized errorresilient application uses the 16 8  or 16 16  inexact multiplication based on the requirement. A fixed width multiplier is the utmost common multiplier used in image processing applications. ...
Article
Full-text available
Compensating the error using additional circuitry is mandatory in a low-error fixed-width multiplier. Instead of compensating the error, reconfiguring n-bit fixed-width multiplier to n/2-bit error-free full-width multiplier using decomposed multiplication is proposed in this paper. The decomposed block multiplication using an area-efficient New Bit Pair Recoding (NBPR) algorithm in fixed-width mode shows a relatively lesser truncation error than existing truncated multipliers. Reconfigurable 16x16 NBPR multiplier in three different modes (8x8, 16x8,16x16) with a fixed 16-bit product is verified on the TSMC 65nm CMOS standard cell library. The experimental results show that the NBPR multiplier consumes a lesser area than standard Booth multipliers. Evaluating the proposed multiplier in imaging shows improved PSNR with minimal error compared to other fixed-width multipliers
... DIGITAL multipliers are widely used in arithmetic units of microprocessors, multimedia and digital signal processors. Many algorithms and architectures have been proposed to design high-speed and low-power multipliers [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13]. A normal binary (NB) multiplication by digital circuits includes three steps. ...
... Booth encoding or a modified Booth encoding (MBE) is usually used in the partial product generator of parallel multipliers to reduce the number of partial product rows by half [5], [6], [10], [11], [12], [13]. A RBPP row can be obtained from two adjacent NB partial product rows by inverting one of the pair rows [5], [6]; an N-bit conventional RB MBE (CRBBE-2) multiplier requires dN=4e RBPP rows. ...
... Each group is decoded by selecting the partial product shown in Table 1, where 2A indicates twice the multiplicand, which can be obtained by left shifting. Negation operation is achieved by inverting each bit of A and adding '1' (defined as correction bit) to the LSB [10], [11], [12], [13]. Methods have been proposed to solve the problem of correction bits for NB radix-4 Booth encoding (NBBE-2) multipliers. ...
Article
Due to its high modularity and carry-free addition, a redundant binary (RB) representation can be used when designing high performance multipliers. The conventional RB multiplier requires an additional RB partial product (RBPP) row, because an error-correcting word (ECW) is generated by both the radix-4 Modified Booth encoding (MBE) and the RB encoding. This incurs in an additional RBPP accumulation stage for the MBE multiplier. In this paper, a new RB modified partial product generator (RBMPPG) is proposed; it removes the extra ECW and hence, it saves one RBPP accumulation stage. Therefore, the proposed RBMPPG generates fewer partial product rows than a conventional RB MBE multiplier. Simulation results show that the proposed RBMPPG based designs significantly improve the area and power consumption when the word length of each operand in the multiplier is at least 32 bits; these reductions over previous NB multiplier designs incur in a modest delay increase (approximately 5 percent). The power-delay product can be reduced by up to 59 percent using the proposed RB multipliers when compared with existing RB multipliers.
... We have implemented twin precision in an efficient manner with less hardware constraint compared to previous implementation, and a suitable algorithm is proposed. Implementing TP in RCMB [11] has achieved reduced area, delay and power compared to prior TP technique applied in MB algorithm. We have analysed that implementing TP technique in RCMB results in better performance, and it has been discussed in the rest of our paper. ...
... RCMB algorithm [11] is the classic twos complement that uses a radix-4 MBE scheme. Partial product row in RCMB ( Figure 7) consists of partial product (Pi,j) that are generated based on booth encoding and decoding, negk signals (k = 0 to ((N/2) − 1)) are added in the LSB position of each partial product row for generating twos complement and 1 s in the leftmost part of partial product rows that are for sign extension in twos complement representation. ...
... Partial product row in RCMB ( Figure 7) consists of partial product (Pi,j) that are generated based on booth encoding and decoding, negk signals (k = 0 to ((N/2) − 1)) are added in the LSB position of each partial product row for generating twos complement and 1 s in the leftmost part of partial product rows that are for sign extension in twos complement representation. Figure 1 in [11] describes the gate level diagram of partial product generation. The maximum height of the partial product in MB is (N/2 + 1). ...
Article
Full-text available
In this paper, we present the performance of twin precision technique in reduced computation modified booth (RCMB) multiplier to achieve double throughput, and an algorithm is proposed. Twin precision technique is the efficient way to obtain double throughput in the multipliers. We describe how to apply twin precision technique to RCMB multipliers. Implementation of twin precision in RCMB multiplier requires lesser changes to be made in partial product array for obtaining double throughput. Multiplexers usually do the signal selection for N and N/2 bit multiplication. In RCMB multiplier, [N/2] + 1 partial product are reduced to N/2 rows. Our idea of implementing twin precision technique to RCMB results in less utilisation of multiplexers of about [N/2] + 3 which gave a way for optimization in the twin precision (TP) multiplier. Thereby, we have achieved the drastic reduction in multiplexer utilisation of about 40% to 50% (for N = 8 to 128) compared to the existing twin precision modified booth multiplier. In our proposed optimised TP modified booth multiplier this reduction in multiplexers gave a way for overall reduction in area, power and delay. Lesser utilisation of multiplexer results in the area reduction of about 5% to 18%, delay of 5% to 20% and a considerable reduction in power of 8% to 32% were noticed in the proposed TP booth multiplier for N = 8 to 128. Our proposed optimised TP multiplier is implemented in FFT complex multiplication which is taken as an application case study and achieves better performance (area, delay and power) compare to prior TP multiplier. All our evaluation are made using cadence RTL compiler using TSMC 180 nm library.
... The continuous refinement of the mostly-used design paradigm based on modified Booth algorithm [2] combined to a reduction tree (carry-save-adder array , Dadda [3], HPM [4]) has reached saturation. In [5] and [6] only slight improvements are achieved. Both proposals reduce the partial product number from n/2+1 to n/2 using different circuit optimization techniques of the critical path. ...
... Instead of looking for more effective numeric bases, which is a hard mathematical task, our approach consists in exploiting already existing odd-multiple free recoding algorithms (2 1 , 2 2 , 2 5 , and 2 8 ) to recursively build up generalized odd-multiple free radix 2 r recoding schemes. ...
... It outperforms ß2 8 at all aspects (Fig. 8, 9, 10). Result summary with regard to Dimitrov and Seidel algorithms is given in Table V. B. Radix 2 13 recoding As ß2 8 and ß2 5 show good results for speed and power respectively, they have been merged (ß2 13 ) for a better compromise. However, the Mux saving (130r) is not important enough compared to Mux value (192r) of ß2 8 . ...
Article
Full-text available
In this paper, a new recursive multibit recoding multiplication algorithm is introduced. It provides a general space-time partitioning of the multiplication problem that not only enables a drastic reduction of the number of partial products (n/r ), but also eliminates the need of pre-computing odd multiples of the multiplicand in higher radix (ß ? 8) multiplication. Based on a mathematical proof that any higher radix ß = 2r can be recursively derived from a combination of two or a number of lower radices, a series of generalized radix ß = 2r multipliers are generated by means of primary radices: 21, 22, 25, and 28. A variety of higher-radix (23 ?232) two's complement 64×64 bit serial/parallel multipliers are implemented on Virtex-6 FPGA and characterized in terms of multiply-time, energy consumption per multiply-operation, and area occupation for r value varying from 2 to 64. Compared to reference algorithm, improvement and savings of 8%, 35%, 39% are respectively obtained in terms of speed, power, and area. In addition, a new low-power and highly-flexible radix 2r adapted technique for a multi-precision multiplication is presented.
... Modified booth encoding (MBE) [6] is a technique that has been introduced to reduce the no of pp rows with a maximum height of [n/2] +1 rows. More specifically, Two's complement multiplier [7] using radix-4 MBE generates a pp array with a maximum height of [n/2] rows without any increase of delay, each row of the pp array follows the one of the following possible values: all zeros, +X, +2X [8]. This pp reduction may increases the speed of the multiplier. ...
... A similar study aimed at the reduction of the maximum height to [n/3] but using a different approach has recently presented interesting results in [11]. Thus, in the following, we will evaluate and compare the proposed approach with the technique in [7]. The paper is organized as follows: in section II, the multiplication algorithm based on radix-8 booth recoding process is briefly reviewed and analyzed. ...
... Optimal pipelining in fact, is a key issue in current and future multiplier (or multiplier- add) units: 1) the latency of the pipelined unit is very important, even for throughput oriented applications, as it impacts the energy consumption of the whole core, and 2) the placement of the pipelining flip-flops should at the same time minimize total power, due to the number of flip-flops required and the unbalanced signal propagation paths. The methods proposed in [1] and [2] were mostly focused on two's complement radix-4. ...
... Unsigned multiplication may produce a positive carry out during recoding (this depends of the value of n and the radix used for recoding), leading to one additional row, increasing the maximum height of the partial product array by one row, not just in one but in several columns. For all these reasons, the extend techniques are presented in [1] and [2]. In this work, the present technique that allows partial product arrays of maximum height of n/m (with the goal of not increasing the delay of the partial product generation stage), for r >4 and unsigned multipliers. ...
... In this case, short bitwidth multipliers typically play the role of basic building blocks. Multipliers of moderate bit-width (less than 32 bits) are also being used massively in FPGAs [1]. All the above interprets into a high interest and motivation, for the design of high-performance short or moderate bit-width multipliers. ...
... In the first step, the partial products are generated, in the second step, the partial products are reduced to one row of final sums and one row of final carries and in the third step, the final sums and carries are added to generate the result. There has been abundant work on advanced multiplication algorithms and designs [1] - [14]. Most of the approaches utilize the Modified Booth Encoding approach [3] for the first step due to its ability to reduce the number of partial product rows in half. ...
... The continuous refinement of the mostly-used design paradigm based on modified Booth algorithm [1] combined to a reduction tree (carry-save-adder array , Dadda,…) has reached saturation. In [2] only slight improvements are achieved. The proposal reduces the partial product number from N/2+1 to N/2 using different circuit optimization techniques of the critical path. ...
... Theorem (1) and (2) allow an exponential reduction (1/2 ks and 1/2 k(s+t) , resp.) of the number of odd-multiples in equations (4) and (6) in comparison to equation (2), but at the expense of a linear augmentation (ks-1 and k(s+t)-1, resp.) in the number of additions. The advantage by far outweighs the cost, as practically shown in the next section. ...
Conference Paper
Full-text available
In this paper, a new recursive multibit recoding multiplication algorithm is introduced. It provides a general space-time partitioning of the multiplication problem that not only enables a drastic reduction of the number of partial products (N/r), but also eliminates the need of pre-computing odd multiples of the multiplicand in higher radix (r≥3) multiplication. Based on a mathematical proof that any higher radix-2r can be recursively derived from a combination of two or a number of lower radices, a series of generalized radix-2r multipliers are generated by means of primary radices: 21, 22, 25, and 28. A variety of higher-radix (23–232) two's complement 64×64 bit serial/parallel multipliers are implemented on Virtex-6 FPGA and characterized in terms of multiply-time, energy consumption per multiply-operation, and area occupation for r value varying from 2 to 64. Compared to a recent published algorithm, savings of 21%, 53%, 105% are respectively obtained in terms of speed, power, and area.
... F. Lamberti et al. suggested the below Table 1 shows the Modified Booth Encoding for commonly used radix-4 booth multiplier [4]. The 2A in top table represents left shift of A by one bit. ...
Article
Full-text available
Multipliers are playing a vital role in DSP and Neural Networks applications. Many methods have been introduced to work on multipliers that offer high speed, less power consumption and reduced area. Booth Algorithm demonstrates an efficient way of signed binary multiplication. In this paper, physical design of 12-bit radix-8 booth multiplier for signed multiplication is presented with an aim to improve the performance metrics such as power, area and delay. The performance of 12-bit radix-8 booth multiplier is compared with the 64-bit radix-16 booth multiplier.
... However, R8BR generates ( n 3 +1) PPs and it speeds up the PP reduction compared to R4BR where R4BR generates ( n 2 +1) PPs for reduction. Efficient R8BR uses 'neg' term elimination and sign prevention circuits for less area, power carrying multiplication [1,10,15,16]. Additionally, the approximated adders are used in PP generation phase for reducing the delay. ...
Article
Full-text available
The delay owing to the generation of odd multiples (±3) in Radix-8 Booth recoding is minimized in this paper using carry resist adder (CRA). CRA is intentionally developed for performing the exact addition of ±1 and ±2 without carry propagation. The theoretical delay analysis proves that the 8-bit CRA reduces 86.26% of delay when compared to the conventional Carry Propagate Addition (CPA) methods. Subsequently, the relative comparisons of CRA with various approximation-based recoding show that the CRA consumes fewer area, power and critical path delay. Further, the 8×8 and 16×16 signed binary multiplication using CRA-based Radix-8 Booth recoder is developed and synthesized on TSMC 65nm CMOS standard cell library. Also, the trade-off between area, power, delay and accuracy is verified for the proposed design using truncation. Finally, the CRA-based truncated Radix-8 Booth 8×8 multiplier is applied to the color space conversion for quantifying its amicability in imaging. The PSNR and MSE are used to evaluate the quality of the resultant image and show better performance than other existing approximated as well as truncated Radix-8 Booth multipliers.
... Goto et al. [2], for example, realized the regularly structured tree multiplier implemented using 0.8μm CMOS process, focusing on layout density and multiplication time. Speed consideration is another example given by Lamberti et al. [3] who introduced a way of reducing computation time in two's complement multipliers with short bit width. Similar work also introduced by Dimitrov et al., who have developed efficient area multipliers based on multiple-radix representations [4]. ...
Article
Full-text available
This paper proposes design and implementation of a 16-bit multiplier based upon Vedic mathematicapproach, where the design has been targeted to the Xilinx Field Programmable Gate Arrays (FPGAs) board, deviceXC5VLX30. The approach is different from a number of approaches that have been used to realize multipliers. Ithas been reported that previous algorithms such as Booth, Modified Booth, and Carry Save Multipliers only suitablefor improving speed or decreasing area utilization; therefore, those algorithms are not appropriate for designingmultipliers that are used for digital signal processing (DSP) applications. Moreover, they are not flexible to beimplemented on FPGAs or on a single chip using application specific integration circuits (ASICs). Vedic approach,on the other hand, can be used to design multipliers with optimum speed and less area utilization. In addition, it isreliable to be implemented on FPGAs or on a single chip. Behavioral and post-route simulation results prove that theproposed multiplier shows better performance in terms of speed compared to the other reported multipliers whenbeing implemented on the FPGA. In terms of area utilization, better results are also obtained.
... Since the publication of Booth's algorithm in 1951, a huge number of improvement attempts were proposed, especially after the publication of a generalized version of MBA algorithm accompanied with its proof [29]. Most of the proposals aimed to reduce the number of partial products either by employing digital optimization techniques [30][31] [32] or by using larger slices (higher radices) [33]. However, experience showed [34] that beyond 4-bit slices (radix 8), the complexity to generate hard partial products can not be managed in a realistic way. ...
Article
Full-text available
ASIC or FPGA implementation of a finite word-length PID controller requires a double expertise: in control system and hardware design. In this paper, we only focus on the hardware side of the problem. Weshow how to design configurable fixed-point PIDs to satisfy applications requiring minimal power consumption, or high control-rate, or both together. As multiply operation is the engine of PID, we experienced three algorithms: Booth, modified Booth, and a new recursive multi-bit multiplication algorithm. This later enables the construction of finely grained PID structures with bit-level and unit-time precision. Such a feature permits to tailor the PID to the desired performance and power budget. All PIDs are implemented at register-transfer-level (RTL) level as technology-independent reusable IP-cores. They are reconfigurable according to two compile-time constants: set-point word-length and latency. To make PID design easily reproducible, all necessary implementation details are provided and discussed.
... The couple (r,s) serves to partition the architecture so that maximum parallelism is exploited. As for area, our proposed architectures require as many hardware resources as modified Booth algorithm [13] with a critical path of N/2 [14][15] [16] [17]. For instance, a 64-bit two's complement finely pipelined multiplier requires a latency of seven clock cycles only (critical path composed of a series of 7 adders). ...
Article
Full-text available
This paper addresses the problem of multiplication with large operand sizes (N ≥ 32). We propose a new recursive recoding algorithm that shortens the critical path of the multiplier and reduces the hardware complexity of partial-product-generators as well. The new recoding algorithm provides an optimal space/time partitioning of the multiplier architecture for any size N of the operands. As a result, the critical path is drastically reduced to 3 3 N/2-3 with no area overhead in comparison to modified Booth algorithm that shows a critical path of N/2 in adder stages. For instance, only 7 adder stages are needed for a 64-bit two's complement multiplier. Confronted to reference algorithms for N = 64, important gain ratios of 1.62, 1.71, 2.64 are obtained in terms of multiply-time, energy consumption per multiply-operation, and total gate count, respectively.
... Since the publication of Booth's algorithm in 1951, a huge number of improvement attempts were proposed, especially after the publication of a generalized version of modified Booth algorithm accompanied with its proof [6]. Most of the proposals aimed to reduce the number of partial products either by employing digital optimization techniques [7][8] [9] or by using larger slices (higher radices) [10]. However, experience showed [11] that beyond 4-bit slices (radix 8), the complexity to generate hard partial products can not be managed in a realistic way. ...
Article
Full-text available
In embedded control applications, control-rate and energy-consumption are two critical design issues. This paper presents a series of high-speed and lowpower finite-word-length PID controllers based on a new recursive multiplication algorithm. Compared to published results into the same conditions, savings of 431% and 20% are respectively obtained in terms of control-rate and dynamic power consumption. In addition, the new multiplication algorithm generates scalable PID structures that can be tailored to the desired performance and power budget. All PIDs are implemented at RTL level as technology-independent reusable IP-cores. They are reconfigurable according to two compile-time constants: set-point word-length and latency.
Article
Full-text available
The FFT Function in digital signal processing is one of the most important function in several applications such as Image Processing, Wireless Communications and Multimedia. FFT Processors consisting of butterfly structure operations involving necessary operations such as Addition, Subtraction, and Multiplication of complex values. The FFT Butterfly Structure work is designed with a “Vedic Multipliers” for applications at high speed. In this Vedic Multiplier, an algorithm called “Urdhva Triyabhyam” was used to improve its efficiency by optimizing the number of logic gates, constant inputs and garbage outputs. The Data Computation time is reduced by an 3-1-1-2 compressor using reversible logic gates. Hence reducing the surplus power consumption of 11.24% and summation of the partial products is done with less delay factor of about 5.28%. The area, power, delay, area delay product and power delay product are calculated using cadence virtuoso and is implemented in Spartan-6 device family using Xilinx ISE.
Article
Adders and multipliers are the fundamental elements of a signal processing architecture. Improve the speed of addition and multiplication operations while minimizing power consumption and area is the problem of interest of this paper. Two versions of modular hybrid adder structures are proposed. The adder structures are derived through the merging of improved carry skip, carry look ahead and ripple carry adder (RCA) concepts. The proposed adder version-1 has improved speed of operation while maintaining power consumption lower than that of RCA. The proposed adder version-2 achieves further improvement in speed through the addition of incrementation scheme at the cost of slight increase in hardware complexity. Two versions of radix-4 booth multipliers are proposed. Among the two versions, the booth multiplier version-1 has the highest speed and lowest power consumption and version-2 has the lowest area compared to most of the existing architectures. Synthesis result show that the delay of proposed multiplier version-1 is reduced by 20.74%, PDP by 45.62% and ADP by 32.26% in comparison with a typical low PDP 8*8 Booth multiplier while consuming 31.4% less power and 14.59% less area. Cadence software with gpdk 45 nm standard cell library is used for the design and implementation.
Article
The Fast Fourier Transform (FFT) is a digital signal processing (DSP) function most commonly used one in many applications such as imaging, wireless communication, and multimedia. The FFT processors are consists of butterfly structure operations, which includes multiplication, addition and subtraction of complex value data. In this paper, an FFT butterfly structure is designed using the Vedic multiplier for high speed applications. In this Vedic multiplier, Urdhava Triyakbhyam algorithm is utilized to improve its efficiency. Then, detector block is introduced to identify the unwanted portion of the input data to be processed in the data processing unit. Therefore, data computation time is reduced in the detector based Vedic multiplier that supports full range and half range input data. The detector is developed based on Boolean function, to detect the valid ranges of two input operands during input data computation. The detector result is used to select the operand with half range input data for Vedic multiplication and it is disabled the surplus computation. So, it reduces the switching activities in the logic gates and proportionally reduces the power consumption. The proposed design-I is consists of Vedic algorithm and the detection unit. Then, the 3-1-1-2 compressor is designed and it is utilized in the multiplier. The proposed design-II is developed with modified Vedic algorithm, detection unit and proposed 3-1-1-2 compressor. Finally, the radix-2, radix-4, and radix-8 FFT butterflies are implemented using the detection unit based Vedic multiplier, the 3-1-1-2 compressor based multiplier and various existing multiplier. The proposed design-I and proposed design-II is designed and implemented in Spartan-6, Virtex-4 and Virtex-5 FPGA family devices. The proposed reconfigurable Vedic multiplier is simulated and synthesized using Synopsys tools using the 90 nm standard cell library.
Thesis
Full-text available
This thesis addresses the problem of optimal hardware-realization of finite-word-length (FWL) linear controllers dedicated to MEMS applications. The biggest challenge is to ensure satisfactory control performances with a minimal hardware. To come up, two distinct but complementary optimizations can be undertaken: in control theory and in binary arithmetic. Only the latter is involved in this work. Because MEMS applications are targeted, the binary arithmetic must be fast enough to cope with the rapid dynamic of MEMS; power-efficient for an embedded control; highly scalable for an easy adjustment of the control performances; and easily predictable to provide a precise idea on the required logic resources before the implementation. The exploration of a number of binary arithmetics showed that radix-2r is the best candidate that fits the aforementioned requirements. It has been fully exploited to designing efficient multiplier cores, which are the real engine of the linear systems. The radix-2r arithmetic was applied to the hardware integration of two FWL structures: a linear time variant PID controller and a linear time invariant LQG controller with a Kalman filter. Both controllers showed a clear superiority over their existing counterparts, or in comparison to their initial forms.
Article
Proper closed loop has been an ever hot issue in the automotive industry. The industrial equipments governed by PID controllers have very simple control architecture and efficiency but still they find a trouble dueto large power consumption and slow mathematical computation. Many researchers have worked out and are trying to design a low power, less delay PID. This paper reviews three MAC architectures with array, booth and wallace tree multipliers incorporated in PID architecture. The simulations are done and the area, power, delay results are synthesized using Xilinx ISE. Comparisons are made between these three architectures in terms of power delay product and area delay product.
Article
Vedic mathematics is the ancient Indian method of mathematics based on 16 Sutras applicable to various branches of mathematics like trigonometry, calculus, geometry, conics etc. Multiplication is effectively used in modern communication and Digital Signal Processing applications. Ordinary multiplication requires propagation of carry from LSB to MSB while adding binary partial products, which limits the overall speed of multiplication. Vedic mathematics helps in generation of partial products and sums in one step, and ensures reduction in overall propagation delay. Urdhva Tiryakbhyam Sutra and Nikhilam Sutra are the two multiplication techniques used in Vedic mathematics. In this paper, an 8 * 8 Nikhilam Sutra multiplier for three different sets of bases is realized. The concepts of Urdhva Tiryakbhyam Sutra multiplication are used for the implementation of the proposed multiplier. The implementation results are compared with that of a Modified Booth’s multiplier in terms of delay, area and power. The design is synthesized in Synopsys Design Compiler using CMOS 90 nm technology, and results show that the proposed multiplier using Nikhilam Sutra with 25 bases is faster than the Modified Booth’s multiplier by 51.28%.
Article
In this paper, we describe an optimization for binary radix-16 (modified) Booth recoded multipliers to reduce the maximum height of the partial product columns to $\lceil n/4\rceil$ for $n=\text{64-bit}$ unsigned operands. This is in contrast to the conventional maximum height of $\lceil(n+1)/4\rceil$ . Therefore, a reduction of one unit in the maximum height is achieved. This reduction may add flexibility during the design of the pipelined multiplier to meet the design goals, it may allow further optimizations of the partial product array reduction stage in terms of area/delay/power and/or may allow additional addends to be included in the partial product array without increasing the delay. The method can be extended to Booth recoded radix-8 multipliers, signed multipliers, combined signed/unsigned multipliers, and other values of $n$ .
Conference Paper
This paper focuses Two's complement multipliers with Shortest Bit-size were used without any increase in the delay of the partial product stage. This was done by reducing one row the maximum height of the partial product array generated by a radix-4 Modified Booth multiplier, this reduction may allow for a faster compression of the partial product array and regular layout. By using this method, it will reduce the Computation Time in Two's Complement multipliers by Short Bit-Width (size) concept. This method is general and can be extended to higher radix encoding, as well as to any size square and m × n rectangular multipliers.
Conference Paper
High speed multiplier designs have been the primacy for multiplier dominated applications such as wireless communications, computer applications, and image processing. In this paper a high performance fixed word length multiplier design by using recently proposed technique to eliminate the error correcting word and a delay efficient parallel prefix Ling adder for final redundant binary to normal binary (RB-NB) conversion has been proposed. These techniques are selected to make achievable tradeoff for area, power and delay. Due to carry-free addition and adaptability, the redundant binary (RB) representation has been picked up in our high-performance multiplier design for partial product summing tree. This multiplier architecture is compared with the design of conventional redundant binary modified booth encoding multiplier (CRBMBE) for area, power and delay analysis. The designed architecture shows improved performance over conventional redundant binary multiplier in terms of area, delay and power-delay product (PDP).
Conference Paper
For multiplier dominated applications such as digital signal processing, wireless communications, and computer applications, high speed multiplier designs has always been a primary requisite. In this paper a high performance 64×64 bit redundant binary (RB) multiplier have been designed by using recently proposed redundant binary encoding approach to eliminate the error correcting word and a delay efficient parallel prefix Ling adder for final redundant binary to normal binary (RB-NB) conversion. Since redundant binary (RB) representation allows carry-free addition and adaptability, it has been used in 64×64 bit high-performance RB multiplier design for summation of partial product terms. The design of multiplier also reduces redundant partial product accumulation stage when eliminating the error correcting word which improves the complexity and the critical path delay. The performance of RB multiplier design compared with conventional RB modified booth encoding multiplier (CRBMBE). The comparison is based on synthesis result obtained by synthesizing both multiplier architectures targeting a Xilinx FPGA in terms of area and delay analysis.
Article
This brief presents a hardware-efficient logarithm circuit design based on a novel discontinuous piecewise linear approximation method. Hardware synthesis results targeted for a commercial application specific integrated circuit cell library and field-programmable gate array show the practicality of the proposed design. A new figure of merit that combines error, area, time, and power is introduced and used to show that the proposed method provides the designer with useful design options when implementing logarithmic conversion.
Conference Paper
The conventional Modified Booth Encoding (MBE) generates n/2+1 rows instead of n/2 rows and also irregular partial product (PP) array because of the extra neg (sign bit) bit at the lower significant bit (LSB) position of each partial product row. In this, a simple approach has been proposed to generate n/2 partial product rows along with regular partial product arrays, thereby reducing the area and power of MBE multipliers [2]. Here technique to find direct 2's complement has been added to last partial product row in order to reduce partial product rows to n/2. Partial products have been regularized by adding LSB of the partial product row with neg bit. Along with this to generate final result different adders have been used and compared. Ripple carry adder, carry lookahed adder and carry select adder have been used in third and final step. Carry select adder shown significant improvement in delay compared to carry lookahed adder and ripple carry adder.
Article
Proper closed loop has been an ever burning issue in many automotive industries. The industrial equipments which are governed by PID controllers have simple control structure and efficiency but still they suffer from large power consumption and slow mathematical computation. Many researchers have tried and are trying to design a low power, delay less PID. This paper reviews three MAC architectures with array, booth and wallace tree multipliers which in turn incorporated in PID architecture. The simulations are done in Modelsim and power results are synthesized using Xilinx ISE. The results suggest that Wallace tree based MAC unit consumes less power and area.
Conference Paper
Full-text available
A novel high speed booth encoder is designed by utilizing a new truth table. The important advantage of this structure is its low delay with respect to the previously presented papers. Moreover, generating partial products and putting the partial products array in order are done at the same time. Simulation results applied to the Hspice software in TSMC 0.18μm technology proves that the total delay of the proposed structure is about 170ps.
Article
Full-text available
Pseudo-achiral metal-centre driven spontaneous resolution occurred simultaneously in the formation of two Δ- and Λ-isomers of [CdBa(OBA)2(DMF)(CH3OH)(H2O)]·H2O (H2OBA = 4,4′-oxybis(benzoic acid)), which illustrated a clear relationship between chirality and helicity: the absolute sense of a double-helix made of achiral components is induced by metal centres in the two enantiomeric forms.
Conference Paper
Full-text available
A design of high performance 64 bit Multiplier-and-Accumulator (MAC) is implemented in this paper. MAC unit performs important operation in many of the digital signal processing (DSP) applications. The multiplier is designed using modified Wallace multiplier and the adder is done with carry save adder. The total design is coded with verilog-HDL and the synthesis is done using Cadence RTL complier using typical libraries of TSMC 0.18um technology. The total MAC unit operates at 217 MHz. The total power dissipation is 177.732 mW.
Article
Full-text available
In this paper, we present a technique to reduce by one row the maximum height of the partial product array generated by a radix-4 Modified Booth Encoded multiplier, without any increase in the delay of the partial product generation stage. This reduction may allow for a faster compression of the partial product array and regular layouts. This technique is of particular interest in all multiply designs, but especially in short bit-width two's complement multipliers for high-performance embedded cores. Twos complement multipliers are important for a wide range of applications. The proposed method is general and can be extended to higher radix encodings, as well as is used for higher radices encoding for any size of m × n multiplications this reduction may allow for a faster compression of the partial product array and regular layouts. This technique is of particular interest in all multiplier designs, but especially in short bit-width two’s complement multipliers for high-performance embedded cores. With the extra hardware of a (short) 3-bit addition, and the simpler generation of the first partial product row can be achieved. Implementation is done by using Xilinx for synthesis and modelsim for simulation in HDL.
Article
Multiplier, being a very vital part in the design of microprocessor, graphical systems, multimedia systems, DSP system etc. It is very important to have an efficient design in terms of performance, area, speed of the multiplier, and for the same Booth's multiplication algorithm provides a very fundamental platform for all the new advances made for high end multipliers meant for faster multiplication with higher performance. The algorithm provides an efficient encoding of the bits during the first steps of the multiplication process. In pursuit of the same, Radix 4 booths encoding has increased the performance of the multiplier by reducing the number of partial products generated. Radix 4 Booths algorithm produces both positive and negative partial products and implementing the negative partial product nullifies the advances made in different units to some extent if not fully. Most of the research work focuses on the reduction of the number of partial products generated and making efficient implementation of the algorithm. There is very little work done on disposal of the negative partial products generated. The presented work in the paper addresses the issue of disposal of the negative partial products efficiently by computing the 2's complement avoiding the additional adder for adding 1 and generation of long carry chain, hence. The proposed mechanism also continues to support the concept of reducing the partial product and in persuasion of the same it is able to reduce the number of partial product and also improved further from n/2 +1 partial products achieved via modified booths algorithm to n/2. Also, while implementing the proposed mechanism using Verilog HDL, a mode selection capability is provided, enabling the same hardware to act as multiplier and as a simple two's complement calculator using the proposed mechanism. The proposed technique has added advantage in terms of its independentness of the number of bits to be multiplied. It is tested and verified with varied test vectors of different number bit sets. Xilinx synthesis tool is used for synthesis and the multiplier mechanism has a maximum operating frequency of 14.59 MHz and a delay of 7.013 ns.
Conference Paper
In embedded control applications, control-rate and energy-consumption are two critical design issues. This paper presents a series of high-speed and low-power finite-word-length PID controllers based on a new recursive multiplication algorithm. Compared to published results into the same conditions, savings of 431% and 20% are respectively obtained in terms of control-rate and dynamic power consumption. In addition, the new multiplication algorithm generates scalable PID structures that can be tailored to the desired performance and power budget. All PIDs are implemented at RTL level as technology-independent reusable IP-cores. They are reconfigurable according to two compile-time constants: set-point word-length and latency.
Conference Paper
Full-text available
This paper proposes an innovative algorithm to flnd the two's complement of a binary number. The proposed method works in loga- rithmic time (O(logN)) instead of the worst case linear time (O(N)) where a carry has to ripple all the way from LSB to MSB. The proposed method also allows for more regularly structured logic units which can be easily modularized and can be naturally extended to any word size. Our synthesis results show that our method achieves up to 2.8£ of per- formance improvement and up to 7.27£ of power savings compared to the conventional method.
Conference Paper
In our latest approach to datapath synthesis from RTL, datapaths are extracted into largest possible sum-of-product (SOP) blocks, thus making extensive use of carry-save intermediate results and reducing the number of expensive carry-propagations to a minimum. The sum-of-product blocks are then implemented by constraint- and technology-driven generation of partial products, carry-save adder tree and carry-propagate adder. A smart generation feature selects the best among alternative implementation variants. Special datapath library cells are used where available and beneficial. All these measures translate into better performing circuits for simple and complex datapaths in cell-based design.
Article
This paper describes a reconfigurable 4-way SIMD engine fabricated in 45 nm high-k/metal-gate CMOS, targeted for on-die acceleration of vector processing in power-constrained mobile microprocessors. The SIMD accelerator is reconfigured to perform 4-way 16b × 16b multiplies, 32b × 32b multiply, 4-way 16b additions, 2-way 32b additions or 72b addition with single-cycle throughput and wide supply voltage range of operation (1.3 V-230 mV). A reconfigurable 2 × 2 tile of signed 2's complement 16b multipliers, with conditional carry gating in the 72b sparse tree adder, dual-supplies for voltage hopping, and fine-grained power-gating enables peak energy efficiency of 494GOPS/W (measured at 300 mV, 50°C) with a dense layout occupying 0.081 mm<sup>2</sup> while achieving: (i) scalable performance up to 2.8 GHz, 278 mW measured at 1.3 V; (ii) fast single-cycle switching between any operating/idle mode; (iii) configuration-dependent power reduction of up to 41% in total power and 6.5× in active leakage power; (iv) 10× standby leakage reduction during idle mode; (v) deep subthreshold operation measured at 230 mV, 8.8 MHz, 87 ¿W; and (vi) compensation for up to 3× performance variation in ultra-low voltage mode.
Conference Paper
The ability to perform long, accurate molecular dynamics (MD) simulations involving proteins and other biological macromolecules could in principle lead to important scientific advances and provide a powerful new tool for drug discovery. A wide range of biologically important processes, however, occur over time scales on the order of a millisecond ~ several orders of magnitude beyond the duration of the longest previous MD simulations. Our research group has completed a specialized, massively parallel machine called Anton, which is capable of calculating millisecond-scale molecular trajectories at an atomic level of detail. The machine has greatly extended the power of simulation as a tool for understanding the structure and dynamics of proteins, and has already allowed us to observe and analyze important biological phenomena that have not previously been accessible to either computational or experimental study.
Conference Paper
The performance of multiplication is crucial for multimedia applications such as 3D graphics and signal processing systems which depend on extensive numbers of multiplications. Previously reported multiplication algorithms mainly focus on rapidly reducing the partial products rows down to final sums and carries used for the final accumulation. These techniques mostly rely on circuit optimization and minimization of the critical paths. In this paper, an algorithm to achieve fast multiplication in two's complement representation is presented. Indeed, our approach focuses on reducing the number of partial product rows. In turn, this directly influences the speed of the multiplication, even before applying partial products reduction techniques. Fewer partial products rows are produced, thereby lowering the overall operation time. This results in a true diamond-shape for the partial product tree which is more efficient in terms of implementation.
Conference Paper
The AltiVec<sup>TM</sup> technology is an extension to the PowerPC architecture<sup>TM</sup> which provides new computational and storage operations for handling vectors of various data lengths and data types. The first implementation using this technology is a low-cost, low-power processor based on the acclaimed PowerPC 750<sup>TM</sup> microprocessor. This paper describes the microarchitecture and design of the vector arithmetic unit of this implementation
Conference Paper
The multiplier of a S/390 CMOS microprocessor is described. It is implemented in an aggressive static CMOS technology with a 0.20-μm effective channel length. The multiplier has been demonstrated in a single-image shared-memory multiprocessor at frequencies up to 400 MHz. The multiplier requires three machine cycles for a total latency of 7.5 ns, though the design can support a latency of 4.0 ns if the latches are removed. The design goal was to implement a versatile S/980 multiplier with reasonable performance at a very aggressive cycle time. The multiplier implements a radix-8 Booth algorithm and is capable of supporting S/390 floating-point and fixed-point multiplications, and also divisions and square roots. Logic design and physical design issues are discussed relating to the Booth decoding and counter tree implementations
Conference Paper
A novel design technique for the construction of a decrement/increment and two's complement (DIT) circuit is presented. The technique is shown to be highly efficient of both in terms silicon area consumption and time. More interestingly, it is shown that the operation delay is almost independent of the word size, and hence the method is best used for high-density codes. Structurally, the circuit is made of two parallel paths: one for the input data and one for the generation of the control signal to be utilized for DIT operation through the data path. The circuit is designed and simulated for 64-bit word length using CMOS technology. For the worst-case situation, a 14.7 ns response time is reported
Article
It is suggested that the economics of present large-scale scientific computers could benefit from a greater investment in hardware to mechanize multiplication and division than is now common. As a move in this direction, a design is developed for a multiplier which generates the product of two numbers using purely combinational logic, i.e., in one gating step. Using straightforward diode-transistor logic, it appears presently possible to obtain products in under 1, ¿sec, and quotients in 3 ¿sec. A rapid square-root process is also outlined. Approximate component counts are given for the proposed design, and it is found that the cost of the unit would be about 10 per cent of the cost of a modern large-scale computer.
Article
Methods of obtaining high speed in addition, multiplication, and division in parallel binary computers are described and then compared with each other as to efficiency of operation and cost. The transit time of a logical unit is used as a time base in comparing the operating speeds of different methods, and the number of individual logical units required is used in the comparison of costs. The methods described are logical and mathematical, and may be used with various types of circuits. The viewpoint is primarily that of the systems designer, and examples are included wherever doing so clarifies the application of any of these methods to a computer. Specific circuit types are assumed in the examples.
Article
The performance of multiplication is crucial for multimedia applications such as 3D graphics and signal processing systems, which depend on the execution of large numbers of multiplications. Previously reported algorithms mainly focused on rapidly reducing the partial products rows down to final sums and carries used for the final accumulation. These techniques mostly rely on circuit optimization and minimization of the critical paths. In this paper, an algorithm to achieve fast multiplication in two's complement representation is presented. Rather than focusing on reducing the partial products rows down to final sums and carries, our approach strives to generate fewer partial products rows. In turn, this influences the speed of the multiplication, even before applying partial products reduction techniques. Fewer partial products rows are produced, thereby lowering the overall operation time. In addition to the speed improvement, our algorithm results in a true diamond-shape for the partial product tree, which is more efficient in terms of implementation. The synthesis results of our multiplication algorithm using the Artisan TSMC 0.13um 1.2-Volt standard-cell library show 13 percent improvement in speed and 14 percent improvement in power savings for 8-bit times 8-bit multiplications (10 percent and 3 percent, respectively, for 16-bit times 16-bit multiplications) when compared to conventional multiplication algorithms.
Article
We present a high-performance low-power design of linear array multipliers based on a combination of the following techniques: signal flow optimization in [3:2] adder array for partial product reduction, left-to-right leapfrog (LRLF) signal flow, and splitting of the reduction array into upper/lower parts. The resulting upper/lower LRLF (ULLRLF) multiplier is compared with tree multipliers. From automatic layout experiments, we find that ULLRLF multipliers have similar power, delay, and area as tree multipliers for n/spl les/32. With more regularity and inherently shorter interconnects, the ULLRLF structure presents a competitive alternative to tree structures in the design of fast low-power multipliers implemented in deep submicron VLSI technology.
Article
This paper presents a design methodology for high-speed Booth encoded parallel multiplier. For partial product generation, we propose a new modified Booth encoding (MBE) scheme to improve the performance of traditional MBE schemes. For final addition, a new algorithm is developed to construct multiple-level conditional-sum adder (MLCSMA). The proposed algorithm can optimize final adder according to the given cell properties and input delay profile. Compared with a binary tree-based conditional-sum adder, the speed performance improvement is up to 25 percent. On average, the design developed herein reduces the total delay by 8 percent for parallel multiplier. The whole design has been verified by gate level simulation
Article
We present new design and analysis techniques for the synthesis of parallel multiplier circuits that have smaller predicted delay than the best current multipliers. V.G. Oklobdzija et al. (1996) suggested a new approach, the Three-Dimensional Method (TDM), for Partial Product Reduction Tree (PPRT) design that produces multipliers that outperform the current best designs. The goal of TDM is to produce a minimum delay PPRT using full adders. This is done by carefully modeling the relationship of the output delays to the input delays in an adder and, then, interconnecting the adders in a globally optimal way. Oklobdzija et al. suggested a good heuristic for finding the optimal PPRT, but no proofs about the performance of this heuristic were given. We provide a formal characterization of optimal PPRT circuits and prove a number of properties about them. For the problem of summing a set of input bits within the minimum delay, we present an algorithm that produces a minimum delay circuit in time linear in the size of the inputs. Our techniques allow us to prove tight lower bounds on multiplier circuit delays. These results are combined to create a program that finds optimal TDM multiplier designs. Using this program, we can show that, while the heuristic used by Oklobdzija et al. does not always find the optimal TDM circuit, it performs very well in terms of overall PPRT circuit delay. However, our search algorithms find better PPRT circuits for reducing the delay of the entire multiplier
Article
This paper presents a method and an algorithm for generation of a parallel multiplier, which is optimized for speed. This method is applicable to any multiplier size and adaptable to any technology for which speed parameters are known. Most importantly, it is easy to incorporate this method in silicon compilation or logic synthesis tools. The parallel multiplier produced by the proposed method outperforms other schemes used for comparison in our experiment. It uses the minimal number of cells in the partial product reduction tree. These findings are tested on design examples simulated in 1 μ CMOS ASIC technology
Article
This paper describes a 16 × 16 bit single-cycle 2's complement multiplier with a reconfigurable PLA control block fabricated in 90-nm dual-V<sub>t</sub> CMOS technology, operating at 1 GHz, 9 mW (measured at 1.3 V, 50°C). Optimally tiled compressor tree architecture with radix-4 Booth encoding, arrival-profile aware completion adder and low clock power write-port flip-flop circuits enable a dense layout occupying 0.03 mm<sup>2</sup> while simultaneously achieving: 1) low compressor tree fan-outs and wiring complexity; 2) low active leakage power of 540 μW and high noise tolerance with all high-V<sub>t</sub> usage; 3) ultra low standby-mode power of 75 μW and fast wake-up time of <1 cycle using PMOS sleep transistors; 4) scalable multiplier performance up to 1.5 GHz, 32 mW measured at 1.95 V, 50°C, and (v) low-voltage mode multiplier performance of 50 MHz, 79μW measured at 570 mV, 50°C.
Speeding-Up Booth Encoded Multipliers by Reducing the Size of Partial Product Array
  • F Lamberti
  • N Andrikos
  • E Antelo
  • P Montuschi
F. Lamberti, N. Andrikos, E. Antelo, and P. Montuschi, " Speeding-Up Booth Encoded Multipliers by Reducing the Size of Partial Product Array, " internal report, http://arith.polito.it/ ir_mbe.pdf, pp. 1-14, 2009.
he has been a member of the Conference Publication Operating Committee (CPOC), and from
  • Stmicroelectronics
STMicroelectronics, " 130nm HCMOS9 Cell Library, " http:// www.st.com/stonline/products/technologies/soc/evol.htm, 2008, he has been a member of the Conference Publication Operating Committee (CPOC), and from 2007 to 2010, of the Digital Library Operating Committee (DLOC) of the Computer Society. From 2008 to 2009, he has been a member-at-large of the Publication Board of the IEEE Computer Society. Since 2009, he has served as an associate editor of the IEEE Transactions on Computers.
130nm HCMOS9 Cell Library
  • Stmicroelectronics