Fig 11 - uploaded by Trong-Thuc Hoang
Content may be subject to copyright.
The overview architecture of the proposed high-speed unsigned 32-bit multiplier design.  

The overview architecture of the proposed high-speed unsigned 32-bit multiplier design.  

Source publication
Conference Paper
Full-text available
The delay of the multiplier plays a critical role in many high-speed implementations and processors such as RISC, DSP, and image processing cores, etc. In this paper, a design of unsigned 32-bit multiplier is proposed, aiming to achieve the best timing performance with an appropriate area. The proposed architecture consists of a modified Radix-4 Bo...

Similar publications

Preprint
Full-text available
Neural Networks (NN) have been proven to be powerful tools to analyze Big Data. However, traditional CPUs cannot achieve the desired performance and/or energy efficiency for NN applications. Therefore, numerous NN accelerators have been used or designed to meet these goals. These accelerators all fall into three categories: GPGPUs, ASIC NN Accelera...

Citations

... Although the authors claim that he proposed recoding schemes yield considerable performance improvements compared to the most efficient recoding schemes found in literature, there are no comparison results to prove it. An optimization in the encoder of the Modified Booth multiplier is proposed in [9], which rather than processing the 2's complement converter with NOT logic and one addition circuit, moves the addition part to the Wallace Tree Adder in the following step. The authors show gains in the delay results, comparing to the traditional Booth recoding, at the cost of a slight increase in FPGA resources, but with no power results report. ...
... The multiplier performs the multiplication operation over the two input operands; the adder performs the addition of the result of the multiplier with the result of the previous cycle, and the register or accumulator stores the sum for the next cycle addition. Different approaches for multiplication and addition for MAC operations have been described in detail in the literature [11,12]. The essential activity of MAC is to generate the product of two operands X i and Y i and add the result with the previously stored result from the last multiplication, as shown in Eq. (1). ...
... In recent years, researchers have developed different MAC architectures [9][10][11][12][13][14][15][16][17]. For example, a high-speed MAC architecture that promises with an optimized area is proposed in 2007 by Abdelgawad et al. [9]. ...
... This proposed architecture reduces the latency and area of Wallace tree multiplier with the help of the Booth algorithm and compressor adders. In 2014, Luu et al. proposed an unsigned 32-bit multiplier for best timing performance with the optimized area [12]. In 2012, Deepak et al. [14] proposed a novel architecture for the multiplier. ...
Article
Full-text available
Recently established Harris Hawks optimization (HHO) has natural behaviour for finding an optimum solution in global search space without getting trapped in previous convergence. However, the exploitation phase of the current Harris Hawks optimizer algorithm is poor. In the present research, an improved version of the Harris Hawks optimization algorithm, which combined HHO with Particle Swarm Optimization and named as ameliorated Harris Hawks optimizer algorithm, has been proposed to find the solution of various optimization problems such as nonlinear, non-convex and highly constrained engineering design problem. In the proposed research, the exploitation phase of the existing HHO algorithm is improved using a particle swarm optimization algorithm and its performance tested for CEC2005, CECE2017 and CEC2018 benchmark problems. Also, discrete algorithms such as FFT algorithms, convolution and image processing algorithm use multiply and accumulate (MAC) unit as a critical component. The efficiency of a MAC is mainly dependent upon the speed of operation, power dissipation and chip area along with the complexity level of the circuit. In this research paper, a power-efficient signed floating-point MAC (SFMAC) is proposed using universal compressor-based multiplier (UCM). Instead of having a complex design architecture, a simple multiplexer-based circuit is used to achieve signed floating output. The 8 × 8 SFMAC can take 8-bit mantissa and 3-bit exponent. And therefore, the input to the SFMAC can be in the range of − (7.96875)10 to + (7.96875)10. The design and implementation of the proposed architecture is done on the Cadence Spectre tool in GPDK 90 nm and TSMC 130 nm technologies. The analysis has proved that the proposed SFMAC architecture has consumed the least power than the recent MAC architectures available in the literature.
... Moreover, the Wallace tree multiplier is highly irregular and complicated. So, in order to overcome the irregular structure, several modified Wallace tree multipliers are proposed in the literature [4,[15][16][17][18][19][20][21][22][23]. All these multiplier architectures are based upon the Wallace tree algorithm. ...
Article
Full-text available
Digital system algorithms such as FFT algorithms, convolution, image processing algorithm, etc. deploy Multiply and Accumulate (MAC) unit as an evaluative component. The efficiency of a MAC typically relies on the speed of operation, power dissipation, and chip area along with the complexity level of the circuit. In this research paper, a power-delay-efficient signed-floating-point MAC (SFMAC) is proposed using Universal Compressor based Multiplier (UCM). Instead of having a complex design architecture, a simple multiplexer-based circuit is used to achieve a signed-floating output. The 8£8 SFMAC can take 8-bit mantissa and 3-bit exponent and therefore, the input to the SFMAC can be in the range of-(7.96875) 10 to +(7.96875) 10. The design and implementation of the proposed architecture is executed on the Cadence Spectre tool in GPDK 90 nm and TSMC 130 nm CMOS, which proves as power and delay efficient.
... This multiplier uses the same traditional approaches without novelty. The Xuan et al. [26] has presented an unsigned 32-bit multiplier using Wallace multiplier and booth encoder. The Carry-lookahead adder and modified Wallace tree adder used as adder-tree and which gives fast results with more area utilization. ...
Article
Full-text available
span lang="EN-US">In digital image processing, the compression mechanism is utilized to enhance the visual perception and storage cost. By using hardware architectures, reconstruction of medical images especially Region of interest (ROI) part using Lossy image compression is a challenging task. In this paper, the ROI Based Discrete wavelet transformation (DWT) using separate Wallace- tree multiplier (WM) and modified Vedic Multiplier (VM) methods are designed. The Lifting based DWT method is used for the ROI compression and reconstruction. The 9/7 filter coefficients are multiplied in DWT using Wallace- tree multiplier (WM) and modified Vedic Multiplier (VM). The designed Wallace tree multiplier works with the parallel mechanism using pipeline architecture results with optimized hardware resources, and 8x8 Vedic multiplier designs improves the ROI reconstruction image quality and fast computation. To evaluate the performance metrics between ROI Based DWT-WM and DWT-VM on FPGA platform, The PSNR and MSE are calculated for different Brain MRI images, and also hardware constraints include Area, Delay, maximum operating frequency and power results are tabulated. The proposed model is designed using Xilinx platform using Verilog-HDL and simulated using ModelSim and Implemented on Artix-7 FPGA device.</span
... Hence a combination of both can provide a better result. There are various multiplier circuits explained in the literature which mainly focuses on the issues of power consumption, delay of the multiplier circuit & lesser area [1,2,3,4,5,6,7,9,10,11,12,13,14,15,16,17,18]. But as per studies, it is found that, area & the speed of operation are the two most conflicting design constraints. ...
... The conventional Wallace tree multiplier algorithm is divided into three stages: Stage 1: partial product generation Stage 2: addition of partial products which creates `sum' & `carry' separately Stage 3: a final adder which is generally a fast adder to add the sum & carry together to yield the final result [9]. ...
... Secondly, Wallace tree multiplier is highly irregular & complicated. So, in order to overcome the irregular structure, several modified Wallace tree algorithms are proposed in the literature [2,3,5,6,7,8,9,12,13,14]. All these multiplier algorithms are based upon Wallace tree. ...
Article
Full-text available
In the era of digital signal processing, such as graphics and computation systems, multiplication is one of the prime operations. A multiplier is a key component in any kind of digital system such as Multiply-Accumulate (MAC) unit, various FFT algorithms, etc. The efficiency of a multiplier is mainly dependent upon the speed of operation and power dissipation of the circuit along with the complexity level of the multiplier. This paper is based on Universal Compressor based Multiplier (UCM), which yields a high-speed operation with comparative power dissipation; hence, the enhanced performance is reported. The novel design of UCM is analyzed using Cadence Spectre tool in 90nm CMOS technology. Finally, the UCM is implemented using Nexys-4 Artix-7 FPGA board. The novel design of UCM has demonstrated significant improvement in terms of delay, which is explored in this paper. Read more: http://www.ijpe-online.com/ucm-a-novel-approach-for-delay-optimization.html#ixzz5oYYgIgld
... Then in each iteration step, the k i -Mul module keeps multiplying the previous value in the register with the new k i value from the k i -LUT. The multiplier used in the design is based on the high-speed unsigned multiplier design in [18]. It is noted that the k i -Mul module has to deal only with the 24-bit of mantissa because the k i values always stay in the range of 0.707 to 1.0 according to Table I, thus leading to a fixed number of length-factor K's sign and exponent parts. ...
... The goal of this module is to multiply the receive x i -y i values with the length-factor K then normalize the results to the IEEE-754 floating-point format. The multipliers used in the design are also based on the high-speed unsigned multiplier design in[18]. After the multiplications, the Lead-One-Detector (LOD) modules start the normalization process by finding the lead positions of the first '1'-bit in the two mantissa values. ...
... The register in the Fig. 8 is reset to the value of 1.0 at the beginning of a computation process. The multiplier used in the implementation is based on the high-speed multiplier design in [26]. The design of k i -Mul module implements the 24-bit of mantissa part only. ...
... The module multiplies the x i -y i values from the XY-Add module with the length-factor K from the k i -Mul module, then normalizes the multiplication results to the IEEE-754 floating-point format. The two multipliers used in the figure are based on the high-speed unsigned multiplier design in [26]. After the multiplications, the normalization process begins with the Lead-One-Detector (LOD) modules to find the positions of the first '1'-bit in the two mantissa values, and then the two mantissa values are shifted to the left by the Left Shifter modules corresponding to the LOD's results. ...
... Khi đó, giá trị ki mới sẽ được nhân với giá trị K hiện tại để tạo ra hệ số K mới. Hệ số K này được lưu trữ bằng thanh ghi và được truyền ra ngõ ra bằng tín hiệu oK như có thể thấy trong Hình 6. Khi bắt đầu, thanh ghi lưu hệ số K được khởi tạo với giá trị là 1. Bởi vì các giá trị ki đều là số dương nên sẽ sử dụng thiết kế bộ nhân không dấu tốc độ cao kế thừa từ công trình trước đó [25]. Hệ số K có giá trị từ 0,60725 đến 1. Hệ số này đạt được giá trị nhỏ nhất là 0,60725 khi tất cả 16 giá trị nhân với nhau. ...
... Bởi vì phép nhân không cần quy đồng hệ số mũ nên giá trị mantissa của K được nhân trực tiếp với phần mantissa của hai giá trị vào là X và Y . Việc thực thi bộ nhân được dựa vào bộ nhân không dấu tốc độ cao kế thừa từ công trình trước đó của nhóm [25]. Kết quả của bộ nhân sẽ được chuyển đến Mô-đun phát hiện số 1 đầu (Lead-One-Detector -LOD) để tìm ra vị trí của bit '1' đầu tiên đang nằm ở vị trí nào trong chuỗi số. ...
Article
Full-text available
In this paper, a single-precision floating-point FFT twiddle factor (TF) implementation is proposed. The architecture is based on the Adaptive Angle Recoding CORDIC (AARC) algorithm. The TF design was built and verified on Altera Stratix IV FPGA chip and 65nm SOTB synthesis. The FPGA implementation had 103.9 MHz maximum frequency, throughput result of 16.966 Mega-Sample per second (MSps), and resources utilization of 7.747 ALUTs and 625 registers. On the other hand, the SOTB synthesis has 16.858 standard cells on an area of 298x291 μm2, 166 MHz maximum frequency, and the speed of 27.107 MSps. The accuracy results were 1.133E-10 Mean-Square-Error (MSE) and about 26 part-per-million (ppm) maximum error.
... At the beginning of a TF computation transaction, the register is reset to the value of one. The multiplier is designed based on the previous work of the authors [25]. ...
... Based on the fact that the floating-point multiplication does not need to balance the exponent parts before the process, the value of K is multiplied directly with the mantissa parts of X and Y input values. The multipliers in the implementation are based on the unsigned high-speed multiplier from our previous work at [25]. The results of the multiplications are transferred to two Lead-One-Detector (LOD) modules to find the position of the first '1'-bit. ...
Conference Paper
Full-text available
In this paper, a single-precision floating-point FFT Twiddle Factor (TF) implementation is proposed. The architecture is based on Adaptive Angle Recoding CORDIC (AARC) algorithm. The TF design is built and verified on Altera Stratix IV FPGA chip and 65nm SOTB synthesis. The FPGA implementation has 103.9 MHz maximum frequency, throughput result of 16.966 Mega-Sample-per-second (MSps), and resources utilization of 7,747 ALUTs and 625 registers. On the other hand, the SOTB synthesis has 16, 858 standard cells on an area of 86,718um2, 166 MHz maximum frequency, and the speed of 27.107 MSps. The accuracy results are 1.133E −10 Mean-Square-Error (MSE) and about 26 part-per-million (ppm) maximum error-ratio.
... It can be seen that the proposed Wallace is much faster as compared to the other high speed Wallace multipliers. There is only one design [22] that is faster than the proposed CBW, however it requires 58 mm2 chip area which is more than 700 times larger as compared to the proposed CBW multiplier. The results in Table III compare only 32-bit multipliers which are not sufficient to make any conclusions. ...
Conference Paper
Full-text available
Wallace tree multipliers provide a power-efficient strategy for high speed multiplication. The use of high speed 7∶3 counters in the Wallace tree reduction can further improve the multiplier speed. This paper presents an algorithmic approach to construct the counter based Wallace tree multipliers. The proposed algorithm can be used to implement the efficient counter based Wallace multiplier of any size suitable for FPGA or ASIC synthesis tools. The designs are synthesized in Synopsys Design Compiler using 90 nm CMOS technology. The detailed comparison of traditional and counter based Wallace multipliers is performed which shows that the counter based Wallace multiplier is up to 22% faster as compared to the traditional Wallace multiplier.