Article

Reducing the Computation Time in (Short Bit-Width) Two's Complement Multipliers

March 2011
IEEE Transactions on Computers 60(2):148 - 156

March 2011
60(2):148 - 156

DOI:10.1109/TC.2010.156

Source
IEEE Xplore

Authors:

F. Lamberti

Politecnico di Torino

Nikos Andrikos

Mentor Graphics

Elisardo Antelo

University of Santiago de Compostela

Two's complement multipliers are important for a wide range of applications. In this paper, we present a technique to reduce by one row the maximum height of the partial product array generated by a radix-4 Modified Booth Encoded multiplier, without any increase in the delay of the partial product generation stage. This reduction may allow for a faster compression of the partial product array and regular layouts. This technique is of particular interest in all multiplier designs, but especially in short bit-width two's complement multipliers for high-performance embedded cores. The proposed method is general and can be extended to higher radix encodings, as well as to any size square and m times n rectangular multipliers. We evaluated the proposed approach by comparison with some other possible solutions; the results based on a rough theoretical analysis and on logic synthesis showed its efficiency in terms of both area and delay.

Low-Error Reconfigurable Fixed-Width Multiplier for Image Processing Applications

Article

Full-text available

Mar 2020

Compensating the error using additional circuitry is mandatory in a low-error fixed-width multiplier. Instead of compensating the error, reconfiguring n-bit fixed-width multiplier to n/2-bit error-free full-width multiplier using decomposed multiplication is proposed in this paper. The decomposed block multiplication using an area-efficient New Bit Pair Recoding (NBPR) algorithm in fixed-width mode shows a relatively lesser truncation error than existing truncated multipliers. Reconfigurable 16x16 NBPR multiplier in three different modes (8x8, 16x8,16x16) with a fixed 16-bit product is verified on the TSMC 65nm CMOS standard cell library. The experimental results show that the NBPR multiplier consumes a lesser area than standard Booth multipliers. Evaluating the proposed multiplier in imaging shows improved PSNR with minimal error compared to other fixed-width multipliers

A Modified Partial Product Generator for Redundant Binary Multipliers

Article

Jan 2015

Due to its high modularity and carry-free addition, a redundant binary (RB) representation can be used when designing high performance multipliers. The conventional RB multiplier requires an additional RB partial product (RBPP) row, because an error-correcting word (ECW) is generated by both the radix-4 Modified Booth encoding (MBE) and the RB encoding. This incurs in an additional RBPP accumulation stage for the MBE multiplier. In this paper, a new RB modified partial product generator (RBMPPG) is proposed; it removes the extra ECW and hence, it saves one RBPP accumulation stage. Therefore, the proposed RBMPPG generates fewer partial product rows than a conventional RB MBE multiplier. Simulation results show that the proposed RBMPPG based designs significantly improve the area and power consumption when the word length of each operand in the multiplier is at least 32 bits; these reductions over previous NB multiplier designs incur in a modest delay increase (approximately 5 percent). The power-delay product can be reduced by up to 59 percent using the proposed RB multipliers when compared with existing RB multipliers.

An optimised twin precision multiplier for ASIC environment

Article

Full-text available

Dec 2015

In this paper, we present the performance of twin precision technique in reduced computation modified booth (RCMB) multiplier to achieve double throughput, and an algorithm is proposed. Twin precision technique is the efficient way to obtain double throughput in the multipliers. We describe how to apply twin precision technique to RCMB multipliers. Implementation of twin precision in RCMB multiplier requires lesser changes to be made in partial product array for obtaining double throughput. Multiplexers usually do the signal selection for N and N/2 bit multiplication. In RCMB multiplier, [N/2] + 1 partial product are reduced to N/2 rows. Our idea of implementing twin precision technique to RCMB results in less utilisation of multiplexers of about [N/2] + 3 which gave a way for optimization in the twin precision (TP) multiplier. Thereby, we have achieved the drastic reduction in multiplexer utilisation of about 40% to 50% (for N = 8 to 128) compared to the existing twin precision modified booth multiplier. In our proposed optimised TP modified booth multiplier this reduction in multiplexers gave a way for overall reduction in area, power and delay. Lesser utilisation of multiplexer results in the area reduction of about 5% to 18%, delay of 5% to 20% and a considerable reduction in power of 8% to 32% were noticed in the proposed TP booth multiplier for N = 8 to 128. Our proposed optimised TP multiplier is implemented in FFT complex multiplication which is taken as an application case study and achieves better performance (area, delay and power) compare to prior TP multiplier. All our evaluation are made using cadence RTL compiler using TSMC 180 nm library.

A New Recursive Multibit Recoding Algorithm for High-Speed and Low-Power Multiplier

Article

Full-text available

Dec 2012
J Low Power Electron

In this paper, a new recursive multibit recoding multiplication algorithm is introduced. It provides a general space-time partitioning of the multiplication problem that not only enables a drastic reduction of the number of partial products (n/r ), but also eliminates the need of pre-computing odd multiples of the multiplicand in higher radix (ß ? 8) multiplication. Based on a mathematical proof that any higher radix ß = 2r can be recursively derived from a combination of two or a number of lower radices, a series of generalized radix ß = 2r multipliers are generated by means of primary radices: 21, 22, 25, and 28. A variety of higher-radix (23 ?232) two's complement 64×64 bit serial/parallel multipliers are implemented on Virtex-6 FPGA and characterized in terms of multiply-time, energy consumption per multiply-operation, and area occupation for r value varying from 2 to 64. Compared to reference algorithm, improvement and savings of 8%, 35%, 39% are respectively obtained in terms of speed, power, and area. In addition, a new low-power and highly-flexible radix 2r adapted technique for a multi-precision multiplication is presented.

Design and Verification of High Speed Multiplier

Article

Nov 2013

PARTIAL PRODUCT ARRAY HEIGHT REDUCTION USING RADIX-16 FOR 64-BIT BOOTH MULTIPLIER

Article

Jun 2019

Novel Booth Encoder and Decoder for Parallel Multiplier Design

Article

Jan 2013

Mamtha Prajapati Mamtha Prajapati

New High-Speed and Low-Power Radix-2r Multiplication Algorithms

Conference Paper

Full-text available

Jun 2012

In this paper, a new recursive multibit recoding multiplication algorithm is introduced. It provides a general space-time partitioning of the multiplication problem that not only enables a drastic reduction of the number of partial products (N/r), but also eliminates the need of pre-computing odd multiples of the multiplicand in higher radix (r≥3) multiplication. Based on a mathematical proof that any higher radix-2r can be recursively derived from a combination of two or a number of lower radices, a series of generalized radix-2r multipliers are generated by means of primary radices: 21, 22, 25, and 28. A variety of higher-radix (23–232) two's complement 64×64 bit serial/parallel multipliers are implemented on Virtex-6 FPGA and characterized in terms of multiply-time, energy consumption per multiply-operation, and area occupation for r value varying from 2 to 64. Compared to a recent published algorithm, savings of 21%, 53%, 105% are respectively obtained in terms of speed, power, and area.

Asic Implementation of 12-Bit Radix-8 Booth Multiplier

Article

Full-text available

Jul 2019

Multipliers are playing a vital role in DSP and Neural Networks applications. Many methods have been introduced to work on multipliers that offer high speed, less power consumption and reduced area. Booth Algorithm demonstrates an efficient way of signed binary multiplication. In this paper, physical design of 12-bit radix-8 booth multiplier for signed multiplication is presented with an aim to improve the performance metrics such as power, area and delay. The performance of 12-bit radix-8 booth multiplier is compared with the 64-bit radix-16 booth multiplier.

Truncated Multiplier with Delay-Minimized Exact Radix-8 Booth Recoder Using Carry Resist Adder

Article

Full-text available

Apr 2021
CIRC SYST SIGNAL PR

The delay owing to the generation of odd multiples (±3) in Radix-8 Booth recoding is minimized in this paper using carry resist adder (CRA). CRA is intentionally developed for performing the exact addition of ±1 and ±2 without carry propagation. The theoretical delay analysis proves that the 8-bit CRA reduces 86.26% of delay when compared to the conventional Carry Propagate Addition (CPA) methods. Subsequently, the relative comparisons of CRA with various approximation-based recoding show that the CRA consumes fewer area, power and critical path delay. Further, the 8×8 and 16×16 signed binary multiplication using CRA-based Radix-8 Booth recoder is developed and synthesized on TSMC 65nm CMOS standard cell library. Also, the trade-off between area, power, delay and accuracy is verified for the proposed design using truncation. Finally, the CRA-based truncated Radix-8 Booth 8×8 multiplier is applied to the color space conversion for quantifying its amicability in imaging. The PSNR and MSE are used to evaluate the quality of the resultant image and show better performance than other existing approximated as well as truncated Radix-8 Booth multipliers.

FPGA Implementation of 16-bit Multipliers based upon Vedic Mathematic Approach

Article

Full-text available

Jan 2014

Zulhelmi Zulhelmi

This paper proposes design and implementation of a 16-bit multiplier based upon Vedic mathematicapproach, where the design has been targeted to the Xilinx Field Programmable Gate Arrays (FPGAs) board, deviceXC5VLX30. The approach is different from a number of approaches that have been used to realize multipliers. Ithas been reported that previous algorithms such as Booth, Modified Booth, and Carry Save Multipliers only suitablefor improving speed or decreasing area utilization; therefore, those algorithms are not appropriate for designingmultipliers that are used for digital signal processing (DSP) applications. Moreover, they are not flexible to beimplemented on FPGAs or on a single chip using application specific integration circuits (ASICs). Vedic approach,on the other hand, can be used to design multipliers with optimum speed and less area utilization. In addition, it isreliable to be implemented on FPGAs or on a single chip. Behavioral and post-route simulation results prove that theproposed multiplier shows better performance in terms of speed compared to the other reported multipliers whenbeing implemented on the FPGA. In terms of area utilization, better results are also obtained.

Design of High-Speed and Low-Power Finite-Word-Length PID Controllers

Article

Full-text available

Apr 2014
J Contr Theor Appl

ASIC or FPGA implementation of a finite word-length PID controller requires a double expertise: in control system and hardware design. In this paper, we only focus on the hardware side of the problem. Weshow how to design configurable fixed-point PIDs to satisfy applications requiring minimal power consumption, or high control-rate, or both together. As multiply operation is the engine of PID, we experienced three algorithms: Booth, modified Booth, and a new recursive multi-bit multiplication algorithm. This later enables the construction of finely grained PID structures with bit-level and unit-time precision. Such a feature permits to tailor the PID to the desired performance and power budget. All PIDs are implemented at register-transfer-level (RTL) level as technology-independent reusable IP-cores. They are reconfigurable according to two compile-time constants: set-point word-length and latency. To make PID design easily reproducible, all necessary implementation details are provided and discussed.

A New High Radix-2r (r ≥ 8) Multibit Recoding Algorithm for Large Operand Size (N ≥ 32) Multipliers

Article

Full-text available

Apr 2013
J Low Power Electron

This paper addresses the problem of multiplication with large operand sizes (N ≥ 32). We propose a new recursive recoding algorithm that shortens the critical path of the multiplier and reduces the hardware complexity of partial-product-generators as well. The new recoding algorithm provides an optimal space/time partitioning of the multiplier architecture for any size N of the operands. As a result, the critical path is drastically reduced to 3 3 N/2-3 with no area overhead in comparison to modified Booth algorithm that shows a critical path of N/2 in adder stages. For instance, only 7 adder stages are needed for a 64-bit two's complement multiplier. Confronted to reference algorithms for N = 64, important gain ratios of 1.62, 1.71, 2.64 are obtained in terms of multiply-time, energy consumption per multiply-operation, and total gate count, respectively.

High-Speed and Low-Power PID Structures for Embedded Applications.

Article

Full-text available

Sep 2011
Lect Notes Comput Sci

In embedded control applications, control-rate and energy-consumption are two critical design issues. This paper presents a series of high-speed and lowpower finite-word-length PID controllers based on a new recursive multiplication algorithm. Compared to published results into the same conditions, savings of 431% and 20% are respectively obtained in terms of control-rate and dynamic power consumption. In addition, the new multiplication algorithm generates scalable PID structures that can be tailored to the desired performance and power budget. All PIDs are implemented at RTL level as technology-independent reusable IP-cores. They are reconfigurable according to two compile-time constants: set-point word-length and latency.

Exact and approximate multiplications for signal processing applications

Article

Feb 2023

A Performance Comparison Review of Multiplier Designs

Conference Paper

Dec 2022

Design and Hardware Realization of 32-Bit Multipliers Based on FPGAs

Conference Paper

Sep 2022

Delay and Power Analysis of Modified Booth Multiplier

Conference Paper

Apr 2022

Low-Power Low-Error Fixed-Width Multiplier Design for Digital Signal Processing

Conference Paper

Jan 2021

Design and analysis of High-Speed Low-Power Vedic Multiplier with 3-1-1-2 compressor Using Reversible Logic gates

Article

Full-text available

Feb 2021

The FFT Function in digital signal processing is one of the most important function in several applications such as Image Processing, Wireless Communications and Multimedia. FFT Processors consisting of butterfly structure operations involving necessary operations such as Addition, Subtraction, and Multiplication of complex values. The FFT Butterfly Structure work is designed with a “Vedic Multipliers” for applications at high speed. In this Vedic Multiplier, an algorithm called “Urdhva Triyabhyam” was used to improve its efficiency by optimizing the number of logic gates, constant inputs and garbage outputs. The Data Computation time is reduced by an 3-1-1-2 compressor using reversible logic gates. Hence reducing the surplus power consumption of 11.24% and summation of the partial products is done with less delay factor of about 5.28%. The area, power, delay, area delay product and power delay product are calculated using cadence virtuoso and is implemented in Spartan-6 device family using Xilinx ISE.

Area and Power Efficient 64-Bit Booth Multiplier

Conference Paper

Mar 2020

Efficient modular hybrid adders and Radix-4 booth multipliers for DSP applications

Article

Feb 2020
MICROELECTRON J

Adders and multipliers are the fundamental elements of a signal processing architecture. Improve the speed of addition and multiplication operations while minimizing power consumption and area is the problem of interest of this paper. Two versions of modular hybrid adder structures are proposed. The adder structures are derived through the merging of improved carry skip, carry look ahead and ripple carry adder (RCA) concepts. The proposed adder version-1 has improved speed of operation while maintaining power consumption lower than that of RCA. The proposed adder version-2 achieves further improvement in speed through the addition of incrementation scheme at the cost of slight increase in hardware complexity. Two versions of radix-4 booth multipliers are proposed. Among the two versions, the booth multiplier version-1 has the highest speed and lowest power consumption and version-2 has the lowest area compared to most of the existing architectures. Synthesis result show that the delay of proposed multiplier version-1 is reduced by 20.74%, PDP by 45.62% and ADP by 32.26% in comparison with a typical low PDP 8*8 Booth multiplier while consuming 31.4% less power and 14.59% less area. Cadence software with gpdk 45 nm standard cell library is used for the design and implementation.

A Novel Methodology for Multiplication of Three n-Bit Binary Numbers: Methods and Protocols

Chapter

Jan 2019

Design and Performance Analysis of Reconfigurable Modified Vedic Multiplier with 3-1-1-2 Compressor

Article

Mar 2019
MICROPROCESS MICROSY

The Fast Fourier Transform (FFT) is a digital signal processing (DSP) function most commonly used one in many applications such as imaging, wireless communication, and multimedia. The FFT processors are consists of butterfly structure operations, which includes multiplication, addition and subtraction of complex value data. In this paper, an FFT butterfly structure is designed using the Vedic multiplier for high speed applications. In this Vedic multiplier, Urdhava Triyakbhyam algorithm is utilized to improve its efficiency. Then, detector block is introduced to identify the unwanted portion of the input data to be processed in the data processing unit. Therefore, data computation time is reduced in the detector based Vedic multiplier that supports full range and half range input data. The detector is developed based on Boolean function, to detect the valid ranges of two input operands during input data computation. The detector result is used to select the operand with half range input data for Vedic multiplication and it is disabled the surplus computation. So, it reduces the switching activities in the logic gates and proportionally reduces the power consumption. The proposed design-I is consists of Vedic algorithm and the detection unit. Then, the 3-1-1-2 compressor is designed and it is utilized in the multiplier. The proposed design-II is developed with modified Vedic algorithm, detection unit and proposed 3-1-1-2 compressor. Finally, the radix-2, radix-4, and radix-8 FFT butterflies are implemented using the detection unit based Vedic multiplier, the 3-1-1-2 compressor based multiplier and various existing multiplier. The proposed design-I and proposed design-II is designed and implemented in Spartan-6, Virtex-4 and Virtex-5 FPGA family devices. The proposed reconfigurable Vedic multiplier is simulated and synthesized using Synopsys tools using the 90 nm standard cell library.

Abstract of the Thesis

Thesis

Full-text available

Jan 2014

Abdelkrim Kamel Oudjida

This thesis addresses the problem of optimal hardware-realization of finite-word-length (FWL) linear controllers dedicated to MEMS applications. The biggest challenge is to ensure satisfactory control performances with a minimal hardware. To come up, two distinct but complementary optimizations can be undertaken: in control theory and in binary arithmetic. Only the latter is involved in this work. Because MEMS applications are targeted, the binary arithmetic must be fast enough to cope with the rapid dynamic of MEMS; power-efficient for an embedded control; highly scalable for an easy adjustment of the control performances; and easily predictable to provide a precise idea on the required logic resources before the implementation. The exploration of a number of binary arithmetics showed that radix-2r is the best candidate that fits the aforementioned requirements. It has been fully exploited to designing efficient multiplier cores, which are the real engine of the linear systems. The radix-2r arithmetic was applied to the hardware integration of two FWL structures: a linear time variant PID controller and a linear time invariant LQG controller with a Kalman filter. Both controllers showed a clear superiority over their existing counterparts, or in comparison to their initial forms.

Power Efficient MAC Unit Based Digital PID Controllers

Article

Nov 2016

Proper closed loop has been an ever hot issue in the automotive industry. The industrial equipments governed by PID controllers have very simple control architecture and efficiency but still they find a trouble dueto large power consumption and slow mathematical computation. Many researchers have worked out and are trying to design a low power, less delay PID. This paper reviews three MAC architectures with array, booth and wallace tree multipliers incorporated in PID architecture. The simulations are done and the area, power, delay results are synthesized using Xilinx ISE. Comparisons are made between these three architectures in terms of power delay product and area delay product.

Power-delay-area efficient design of vedic multiplier using adaptable manchester carry chain adder

Conference Paper

Apr 2017

A Delay Efficient Vedic Multiplier

Article

Feb 2018

Vedic mathematics is the ancient Indian method of mathematics based on 16 Sutras applicable to various branches of mathematics like trigonometry, calculus, geometry, conics etc. Multiplication is effectively used in modern communication and Digital Signal Processing applications. Ordinary multiplication requires propagation of carry from LSB to MSB while adding binary partial products, which limits the overall speed of multiplication. Vedic mathematics helps in generation of partial products and sums in one step, and ensures reduction in overall propagation delay. Urdhva Tiryakbhyam Sutra and Nikhilam Sutra are the two multiplication techniques used in Vedic mathematics. In this paper, an 8 * 8 Nikhilam Sutra multiplier for three different sets of bases is realized. The concepts of Urdhva Tiryakbhyam Sutra multiplication are used for the implementation of the proposed multiplier. The implementation results are compared with that of a Modified Booth’s multiplier in terms of delay, area and power. The design is synthesized in Synopsys Design Compiler using CMOS 90 nm technology, and results show that the proposed multiplier using Nikhilam Sutra with 25 bases is faster than the Modified Booth’s multiplier by 51.28%.

An ASIC design of an optimized multiplication using twin precision

Conference Paper

Jun 2017

Improved 64-bit Radix-16 Booth Multiplier Based on Partial Product Array Height Reduction

Article

Dec 2016

In this paper, we describe an optimization for binary radix-16 (modified) Booth recoded multipliers to reduce the maximum height of the partial product columns to $\lceil n/4\rceil$ for $n=\text{64-bit}$ unsigned operands. This is in contrast to the conventional maximum height of $\lceil(n+1)/4\rceil$ . Therefore, a reduction of one unit in the maximum height is achieved. This reduction may add flexibility during the design of the pipelined multiplier to meet the design goals, it may allow further optimizations of the partial product array reduction stage in terms of area/delay/power and/or may allow additional addends to be included in the partial product array without increasing the delay. The method can be extended to Booth recoded radix-8 multipliers, signed multipliers, combined signed/unsigned multipliers, and other values of $n$ .

A fastest multiplier using two's compliment method

Conference Paper

Mar 2016

This paper focuses Two's complement multipliers with Shortest Bit-size were used without any increase in the delay of the partial product stage. This was done by reducing one row the maximum height of the partial product array generated by a radix-4 Modified Booth multiplier, this reduction may allow for a faster compression of the partial product array and regular layout. By using this method, it will reduce the Computation Time in Two's Complement multipliers by Short Bit-Width (size) concept. This method is general and can be extended to higher radix encoding, as well as to any size square and m × n rectangular multipliers.

High performance redundant binary multiplier

Conference Paper

Apr 2016

High speed multiplier designs have been the primacy for multiplier dominated applications such as wireless communications, computer applications, and image processing. In this paper a high performance fixed word length multiplier design by using recently proposed technique to eliminate the error correcting word and a delay efficient parallel prefix Ling adder for final redundant binary to normal binary (RB-NB) conversion has been proposed. These techniques are selected to make achievable tradeoff for area, power and delay. Due to carry-free addition and adaptability, the redundant binary (RB) representation has been picked up in our high-performance multiplier design for partial product summing tree. This multiplier architecture is compared with the design of conventional redundant binary modified booth encoding multiplier (CRBMBE) for area, power and delay analysis. The designed architecture shows improved performance over conventional redundant binary multiplier in terms of area, delay and power-delay product (PDP).

VLSI design of 64bit × 64bit high performance multiplier with redundant binary encoding

Conference Paper

Sep 2016

For multiplier dominated applications such as digital signal processing, wireless communications, and computer applications, high speed multiplier designs has always been a primary requisite. In this paper a high performance 64×64 bit redundant binary (RB) multiplier have been designed by using recently proposed redundant binary encoding approach to eliminate the error correcting word and a delay efficient parallel prefix Ling adder for final redundant binary to normal binary (RB-NB) conversion. Since redundant binary (RB) representation allows carry-free addition and adaptability, it has been used in 64×64 bit high-performance RB multiplier design for summation of partial product terms. The design of multiplier also reduces redundant partial product accumulation stage when eliminating the error correcting word which improves the complexity and the critical path delay. The performance of RB multiplier design compared with conventional RB modified booth encoding multiplier (CRBMBE). The comparison is based on synthesis result obtained by synthesizing both multiplier architectures targeting a Xilinx FPGA in terms of area and delay analysis.

Accurate Hardware-Efficient Logarithm Circuit

Article

Sep 2016

This brief presents a hardware-efficient logarithm circuit design based on a novel discontinuous piecewise linear approximation method. Hardware synthesis results targeted for a commercial application specific integrated circuit cell library and field-programmable gate array show the practicality of the proposed design. A new figure of merit that combines error, area, time, and power is introduced and used to show that the proposed method provides the designer with useful design options when implementing logarithmic conversion.

Design and Comparison of Regularize Modified Booth Multiplier Using Different Adders

Conference Paper

Dec 2013

The conventional Modified Booth Encoding (MBE) generates n/2+1 rows instead of n/2 rows and also irregular partial product (PP) array because of the extra neg (sign bit) bit at the lower significant bit (LSB) position of each partial product row. In this, a simple approach has been proposed to generate n/2 partial product rows along with regular partial product arrays, thereby reducing the area and power of MBE multipliers [2]. Here technique to find direct 2's complement has been added to last partial product row in order to reduce partial product rows to n/2. Partial products have been regularized by adding LSB of the partial product row with neg bit. Along with this to generate final result different adders have been used and compared. Ripple carry adder, carry lookahed adder and carry select adder have been used in third and final step. Carry select adder shown significant improvement in delay compared to carry lookahed adder and ripple carry adder.

Design of efficient multiply-accumulate block for PID controllers

Article

Jun 2015

Proper closed loop has been an ever burning issue in many automotive industries. The industrial equipments which are governed by PID controllers have simple control structure and efficiency but still they suffer from large power consumption and slow mathematical computation. Many researchers have tried and are trying to design a low power, delay less PID. This paper reviews three MAC architectures with array, booth and wallace tree multipliers which in turn incorporated in PID architecture. The simulations are done in Modelsim and power results are synthesized using Xilinx ISE. The results suggest that Wallace tree based MAC unit consumes less power and area.

An ultra high speed booth encoder structure for fast arithmetic operations

Conference Paper

Full-text available

Jun 2015

A novel high speed booth encoder is designed by utilizing a new truth table. The important advantage of this structure is its low delay with respect to the previously presented papers. Moreover, generating partial products and putting the partial products array in order are done at the same time. Simulation results applied to the Hspice software in TSMC 0.18μm technology proves that the total delay of the proposed structure is about 170ps.

Absolute helicity induction: Chiral information transfer from metal centre to the framework

Article

Full-text available

Feb 2014
CRYSTENGCOMM

Pseudo-achiral metal-centre driven spontaneous resolution occurred simultaneously in the formation of two Δ- and Λ-isomers of [CdBa(OBA)2(DMF)(CH3OH)(H2O)]·H2O (H2OBA = 4,4′-oxybis(benzoic acid)), which illustrated a clear relationship between chirality and helicity: the absolute sense of a double-helix made of achiral components is induced by metal centres in the two enantiomeric forms.

Design of high performance 64 bit MAC unit

Conference Paper

Full-text available

Mar 2013

A design of high performance 64 bit Multiplier-and-Accumulator (MAC) is implemented in this paper. MAC unit performs important operation in many of the digital signal processing (DSP) applications. The multiplier is designed using modified Wallace multiplier and the adder is done with carry save adder. The total design is coded with verilog-HDL and the synthesis is done using Cadence RTL complier using typical libraries of TSMC 0.18um technology. The total MAC unit operates at 217 MHz. The total power dissipation is 177.732 mW.

Implementation of Radix-4 in 2’s Complement Modified Booth Encoded Multiplier

Article

Full-text available

Sep 2013

David Solomon Raju

In this paper, we present a technique to reduce by one row the maximum height of the partial product array generated by a radix-4 Modified Booth Encoded multiplier, without any increase in the delay of the partial product generation stage. This reduction may allow for a faster compression of the partial product array and regular layouts. This technique is of particular interest in all multiply designs, but especially in short bit-width two's complement multipliers for high-performance embedded cores. Twos complement multipliers are important for a wide range of applications. The proposed method is general and can be extended to higher radix encodings, as well as is used for higher radices encoding for any size of m × n multiplications this reduction may allow for a faster compression of the partial product array and regular layouts. This technique is of particular interest in all multiplier designs, but especially in short bit-width two’s complement multipliers for high-performance embedded cores. With the extra hardware of a (short) 3-bit addition, and the simpler generation of the first partial product row can be achieved. Implementation is done by using Xilinx for synthesis and modelsim for simulation in HDL.

Disposition (reduction) of (negative) partial product for Radix 4 Booth's Algorithm

Article

Dec 2011

Multiplier, being a very vital part in the design of microprocessor, graphical systems, multimedia systems, DSP system etc. It is very important to have an efficient design in terms of performance, area, speed of the multiplier, and for the same Booth's multiplication algorithm provides a very fundamental platform for all the new advances made for high end multipliers meant for faster multiplication with higher performance. The algorithm provides an efficient encoding of the bits during the first steps of the multiplication process. In pursuit of the same, Radix 4 booths encoding has increased the performance of the multiplier by reducing the number of partial products generated. Radix 4 Booths algorithm produces both positive and negative partial products and implementing the negative partial product nullifies the advances made in different units to some extent if not fully. Most of the research work focuses on the reduction of the number of partial products generated and making efficient implementation of the algorithm. There is very little work done on disposal of the negative partial products generated. The presented work in the paper addresses the issue of disposal of the negative partial products efficiently by computing the 2's complement avoiding the additional adder for adding 1 and generation of long carry chain, hence. The proposed mechanism also continues to support the concept of reducing the partial product and in persuasion of the same it is able to reduce the number of partial product and also improved further from n/2 +1 partial products achieved via modified booths algorithm to n/2. Also, while implementing the proposed mechanism using Verilog HDL, a mode selection capability is provided, enabling the same hardware to act as multiplier and as a simple two's complement calculator using the proposed mechanism. The proposed technique has added advantage in terms of its independentness of the number of bits to be multiplied. It is tested and verified with varied test vectors of different number bit sets. Xilinx synthesis tool is used for synthesis and the multiplier mechanism has a maximum operating frequency of 14.59 MHz and a delay of 7.013 ns.

High-Speed and Low Power PID Structures for Embedded Applications

Conference Paper

Sep 2011

In embedded control applications, control-rate and energy-consumption are two critical design issues. This paper presents a series of high-speed and low-power finite-word-length PID controllers based on a new recursive multiplication algorithm. Compared to published results into the same conditions, savings of 431% and 20% are respectively obtained in terms of control-rate and dynamic power consumption. In addition, the new multiplication algorithm generates scalable PID structures that can be tailored to the desired performance and power budget. All PIDs are implemented at RTL level as technology-independent reusable IP-cores. They are reconfigurable according to two compile-time constants: set-point word-length and latency.

A Logarithmic Time Method for Two’s Complementation

Conference Paper

Full-text available

May 2005

This paper proposes an innovative algorithm to flnd the two's complement of a binary number. The proposed method works in loga- rithmic time (O(logN)) instead of the worst case linear time (O(N)) where a carry has to ripple all the way from LSB to MSB. The proposed method also allows for more regularly structured logic units which can be easily modularized and can be naturally extended to any word size. Our synthesis results show that our method achieves up to 2.8£ of per- formance improvement and up to 7.27£ of power savings compared to the conventional method.

Principles of Digital Design

Article

Jan 1996

Daniel Gajski

Computer Arithmetic-Principle, Architecture and Design

Article

Jan 1979

K. Hwang

Some schemes for parallel multipliers

Article

Jan 1965

L. Dadda

Optimized Synthesis of Sum-of- Products

Conference Paper

Dec 2003

In our latest approach to datapath synthesis from RTL, datapaths are extracted into largest possible sum-of-product (SOP) blocks, thus making extensive use of carry-save intermediate results and reducing the number of expensive carry-propagations to a minimum. The sum-of-product blocks are then implemented by constraint- and technology-driven generation of partial products, carry-save adder tree and carry-propagate adder. A smart generation feature selects the best among alternative implementation variants. Special datapath library cells are used where available and beneficial. All these measures translate into better performing circuits for simple and complex datapaths in cell-based design.

A 300 mV 494GOPS/W reconfigurable dual-supply 4-way SIMD vector processing accelerator in 45 nm CMOS

Article

Feb 2010

This paper describes a reconfigurable 4-way SIMD engine fabricated in 45 nm high-k/metal-gate CMOS, targeted for on-die acceleration of vector processing in power-constrained mobile microprocessors. The SIMD accelerator is reconfigured to perform 4-way 16b Ã 16b multiplies, 32b Ã 32b multiply, 4-way 16b additions, 2-way 32b additions or 72b addition with single-cycle throughput and wide supply voltage range of operation (1.3 V-230 mV). A reconfigurable 2 Ã 2 tile of signed 2's complement 16b multipliers, with conditional carry gating in the 72b sparse tree adder, dual-supplies for voltage hopping, and fine-grained power-gating enables peak energy efficiency of 494GOPS/W (measured at 300 mV, 50Â°C) with a dense layout occupying 0.081 mm2 while achieving: (i) scalable performance up to 2.8 GHz, 278 mW measured at 1.3 V; (ii) fast single-cycle switching between any operating/idle mode; (iii) configuration-dependent power reduction of up to 41% in total power and 6.5Ã in active leakage power; (iv) 10Ã standby leakage reduction during idle mode; (v) deep subthreshold operation measured at 230 mV, 8.8 MHz, 87 Â¿W; and (vi) compensation for up to 3Ã performance variation in ultra-low voltage mode.

Anton: A Specialized Machine for Millisecond-Scale Molecular Dynamics Simulations of Proteins

Conference Paper

Jun 2009

David E. Shaw

The ability to perform long, accurate molecular dynamics (MD) simulations involving proteins and other biological macromolecules could in principle lead to important scientific advances and provide a powerful new tool for drug discovery. A wide range of biologically important processes, however, occur over time scales on the order of a millisecond ~ several orders of magnitude beyond the duration of the longest previous MD simulations. Our research group has completed a specialized, massively parallel machine called Anton, which is capable of calculating millisecond-scale molecular trajectories at an atomic level of detail. The machine has greatly extended the power of simulation as a tool for understanding the structure and dynamics of proteins, and has already allowed us to observe and analyze important biological phenomena that have not previously been accessible to either computational or experimental study.

A Fast and Well-Structured Multiplier

Conference Paper

Jan 2004

The performance of multiplication is crucial for multimedia applications such as 3D graphics and signal processing systems which depend on extensive numbers of multiplications. Previously reported multiplication algorithms mainly focus on rapidly reducing the partial products rows down to final sums and carries used for the final accumulation. These techniques mostly rely on circuit optimization and minimization of the critical paths. In this paper, an algorithm to achieve fast multiplication in two's complement representation is presented. Indeed, our approach focuses on reducing the number of partial product rows. In turn, this directly influences the speed of the multiplication, even before applying partial products reduction techniques. Fewer partial products rows are produced, thereby lowering the overall operation time. This results in a true diamond-shape for the partial product tree which is more efficient in terms of implementation.

A low-power, high-speed implementation of a PowerPCTM microprocessor vector extension

Conference Paper

Feb 1999

The AltiVecTM technology is an extension to the PowerPC architectureTM which provides new computational and storage operations for handling vectors of various data lengths and data types. The first implementation using this technology is a low-cost, low-power processor based on the acclaimed PowerPC 750TM microprocessor. This paper describes the microarchitecture and design of the vector arithmetic unit of this implementation

A radix-8 CMOS S/390 multiplier

Conference Paper

Aug 1997

The multiplier of a S/390 CMOS microprocessor is described. It is implemented in an aggressive static CMOS technology with a 0.20-μm effective channel length. The multiplier has been demonstrated in a single-image shared-memory multiprocessor at frequencies up to 400 MHz. The multiplier requires three machine cycles for a total latency of 7.5 ns, though the design can support a latency of 4.0 ns if the latches are removed. The design goal was to implement a versatile S/980 multiplier with reasonable performance at a very aggressive cycle time. The multiplier implements a radix-8 Booth algorithm and is capable of supporting S/390 floating-point and fixed-point multiplications, and also divisions and square roots. Logic design and physical design issues are discussed relating to the Booth decoding and counter tree implementations

A new parallel technique for design of decrement/increment and two's complement circuits

Conference Paper

Jun 1991

A novel design technique for the construction of a decrement/increment and two's complement (DIT) circuit is presented. The technique is shown to be highly efficient of both in terms silicon area consumption and time. More interestingly, it is shown that the operation delay is almost independent of the word size, and hence the method is best used for high-density codes. Structurally, the circuit is made of two parallel paths: one for the input data and one for the generation of the control signal to be utilized for DIT operation through the data path. The circuit is designed and simulated for 64-bit word length using CMOS technology. For the worst-case situation, a 14.7 ns response time is reported

A Suggestion for a Fast Multiplier

Article

Mar 1964

C. S. Wallace

It is suggested that the economics of present large-scale scientific computers could benefit from a greater investment in hardware to mechanize multiplication and division than is now common. As a move in this direction, a design is developed for a multiplier which generates the product of two numbers using purely combinational logic, i.e., in one gating step. Using straightforward diode-transistor logic, it appears presently possible to obtain products in under 1, Â¿sec, and quotients in 3 Â¿sec. A rapid square-root process is also outlined. Approximate component counts are given for the proposed design, and it is found that the cost of the unit would be about 10 per cent of the cost of a modern large-scale computer.

High-Speed Arithmetic in Binary Computers

Article

Feb 1961

O.L. Macsorley

Methods of obtaining high speed in addition, multiplication, and division in parallel binary computers are described and then compared with each other as to efficiency of operation and cost. The transit time of a logical unit is used as a time base in comparing the operating speeds of different methods, and the number of individual logical units required is used in the comparison of costs. The methods described are logical and mathematical, and may be used with various types of circuits. The viewpoint is primarily that of the systems designer, and examples are included wherever doing so clarifies the application of any of these methods to a computer. Specific circuit types are assumed in the examples.

A Simple High-Speed Multiplier Design

Article

Nov 2006

The performance of multiplication is crucial for multimedia applications such as 3D graphics and signal processing systems, which depend on the execution of large numbers of multiplications. Previously reported algorithms mainly focused on rapidly reducing the partial products rows down to final sums and carries used for the final accumulation. These techniques mostly rely on circuit optimization and minimization of the critical paths. In this paper, an algorithm to achieve fast multiplication in two's complement representation is presented. Rather than focusing on reducing the partial products rows down to final sums and carries, our approach strives to generate fewer partial products rows. In turn, this influences the speed of the multiplication, even before applying partial products reduction techniques. Fewer partial products rows are produced, thereby lowering the overall operation time. In addition to the speed improvement, our algorithm results in a true diamond-shape for the partial product tree, which is more efficient in terms of implementation. The synthesis results of our multiplication algorithm using the Artisan TSMC 0.13um 1.2-Volt standard-cell library show 13 percent improvement in speed and 14 percent improvement in power savings for 8-bit times 8-bit multiplications (10 percent and 3 percent, respectively, for 16-bit times 16-bit multiplications) when compared to conventional multiplication algorithms.

High-Performance Low-Power Left-to-Right Array Multiplier Design

Article

Apr 2005

We present a high-performance low-power design of linear array multipliers based on a combination of the following techniques: signal flow optimization in [3:2] adder array for partial product reduction, left-to-right leapfrog (LRLF) signal flow, and splitting of the reduction array into upper/lower parts. The resulting upper/lower LRLF (ULLRLF) multiplier is compared with tree multipliers. From automatic layout experiments, we find that ULLRLF multipliers have similar power, delay, and area as tree multipliers for n/spl les/32. With more regularity and inherently shorter interconnects, the ULLRLF structure presents a competitive alternative to tree structures in the design of fast low-power multipliers implemented in deep submicron VLSI technology.

High-speed Booth encoded parallel multiplier design

Article

Aug 2000

This paper presents a design methodology for high-speed Booth encoded parallel multiplier. For partial product generation, we propose a new modified Booth encoding (MBE) scheme to improve the performance of traditional MBE schemes. For final addition, a new algorithm is developed to construct multiple-level conditional-sum adder (MLCSMA). The proposed algorithm can optimize final adder according to the given cell properties and input delay profile. Compared with a binary tree-based conditional-sum adder, the speed performance improvement is up to 25 percent. On average, the design developed herein reduces the total delay by 8 percent for parallel multiplier. The whole design has been verified by gate level simulation

Optimal circuits for parallel multipliers

Article

Apr 1998

We present new design and analysis techniques for the synthesis of parallel multiplier circuits that have smaller predicted delay than the best current multipliers. V.G. Oklobdzija et al. (1996) suggested a new approach, the Three-Dimensional Method (TDM), for Partial Product Reduction Tree (PPRT) design that produces multipliers that outperform the current best designs. The goal of TDM is to produce a minimum delay PPRT using full adders. This is done by carefully modeling the relationship of the output delays to the input delays in an adder and, then, interconnecting the adders in a globally optimal way. Oklobdzija et al. suggested a good heuristic for finding the optimal PPRT, but no proofs about the performance of this heuristic were given. We provide a formal characterization of optimal PPRT circuits and prove a number of properties about them. For the problem of summing a set of input bits within the minimum delay, we present an algorithm that produces a minimum delay circuit in time linear in the size of the inputs. Our techniques allow us to prove tight lower bounds on multiplier circuit delays. These results are combined to create a program that finds optimal TDM multiplier designs. Using this program, we can show that, while the heuristic used by Oklobdzija et al. does not always find the optimal TDM circuit, it performs very well in terms of overall PPRT circuit delay. However, our search algorithms find better PPRT circuits for reducing the delay of the entire multiplier

A method for speed optimized partial product reduction and generation of fast parallel multipliers using an algorithmic approach

Article

Apr 1996

This paper presents a method and an algorithm for generation of a parallel multiplier, which is optimized for speed. This method is applicable to any multiplier size and adaptable to any technology for which speed parameters are known. Most importantly, it is easy to incorporate this method in silicon compilation or logic synthesis tools. The parallel multiplier produced by the proposed method outperforms other schemes used for comparison in our experiment. It uses the minimal number of cells in the partial product reduction tree. These findings are tested on design examples simulated in 1 μ CMOS ASIC technology

A 110 GOPS/W 16-bit multiplier and reconfigurable PLA loop in 90-nm CMOS

Article

Feb 2006

This paper describes a 16 × 16 bit single-cycle 2's complement multiplier with a reconfigurable PLA control block fabricated in 90-nm dual-Vt CMOS technology, operating at 1 GHz, 9 mW (measured at 1.3 V, 50°C). Optimally tiled compressor tree architecture with radix-4 Booth encoding, arrival-profile aware completion adder and low clock power write-port flip-flop circuits enable a dense layout occupying 0.03 mm2 while simultaneously achieving: 1) low compressor tree fan-outs and wiring complexity; 2) low active leakage power of 540 μW and high noise tolerance with all high-Vt usage; 3) ultra low standby-mode power of 75 μW and fast wake-up time of <1 cycle using PMOS sleep transistors; 4) scalable multiplier performance up to 1.5 GHz, 32 mW measured at 1.95 V, 50°C, and (v) low-voltage mode multiplier performance of 50 MHz, 79μW measured at 570 mV, 50°C.

Speeding-Up Booth Encoded Multipliers by Reducing the Size of Partial Product Array

Jan 2009
1-14

F Lamberti
N Andrikos
E Antelo
P Montuschi

F. Lamberti, N. Andrikos, E. Antelo, and P. Montuschi, " Speeding-Up Booth Encoded Multipliers by Reducing the Size of Partial Product Array, " internal report, http://arith.polito.it/ ir_mbe.pdf, pp. 1-14, 2009.

he has been a member of the Conference Publication Operating Committee (CPOC), and from

Jan 2007

Stmicroelectronics

STMicroelectronics, " 130nm HCMOS9 Cell Library, " http:// www.st.com/stonline/products/technologies/soc/evol.htm, 2008, he has been a member of the Conference Publication Operating Committee (CPOC), and from 2007 to 2010, of the Digital Library Operating Committee (DLOC) of the Computer Society. From 2008 to 2009, he has been a member-at-large of the Publication Board of the IEEE Computer Society. Since 2009, he has served as an associate editor of the IEEE Transactions on Computers.

130nm HCMOS9 Cell Library

Stmicroelectronics

Reducing the Computation Time in (Short Bit-Width) Two's Complement Multipliers

Abstract

No full-text available

Recommended publications

Incremental physical design method for flat SOC design

Shot count reduction for non-Manhattan geometries: concurrent optimization of data fracture and mask...

Prying about the Beauty in Form Principle of Intelligent Multimedia Interface Design

Orion Orbit Reaction Control Assessment