Fig 5 - uploaded by Hamid Arabnia
Content may be subject to copyright.
Proposed 4-bit Carry Look-Ahead Adder 

Proposed 4-bit Carry Look-Ahead Adder 

Source publication
Article
Full-text available
Reversible logic is emerging as a promising computing paradigm, having its applications in low-power CMOS, quantum computing, nanotech-nology and optical computing. Firstly, we showed a modified design of conventional BCD subtractors and also proposed designs of carry look-ahead and carry skip BCD subtractors. The proposed designs of carry look-ahe...

Contexts in source publication

Context 1
... MCLA [11] uses the modified full adder (MFA) as shown in Fig.4. In our proposed carry look-ahead adder shown in Fig.5, we propose replacing the 4 th MFA in the MCLA by a full adder to reduce the area (number of gates) without sacrificing speed improvement.It can easily be verified that there will be a reduction in the number of gates to generate the final carry, as shown in Fig.5. ...
Context 2
... MCLA [11] uses the modified full adder (MFA) as shown in Fig.4. In our proposed carry look-ahead adder shown in Fig.5, we propose replacing the 4 th MFA in the MCLA by a full adder to reduce the area (number of gates) without sacrificing speed improvement.It can easily be verified that there will be a reduction in the number of gates to generate the final carry, as shown in Fig.5. In order to have further savings in terms of the number of gates, the proposed 4-bit CLA can be cascaded in a series to design an expanded width CLA, as shown in Fig.6. ...
Context 3
... of the prominent functionalities of the TSG gate is that it can work singly as a reversible full adder unit. Figure 15 shows the implementation of the TSG gate as a reversible full adder. TSG can implement the reversible full adder with a bare minimum of two garbage output (at least two garbage output will be required to realize a reversible full adder). ...
Context 4
... designing the individual reversible components of the carry look-ahead BCD subtractor, the components are combined together to design the complete reversible carry look-ahead BCD subtractor, as shown in Fig.25. It is to be noted that we have used the same strategy of connecting the Feynman gates as chains for generating the XOR, copying and NOT functions, with zero garbage (Please refer to Fig.8, in which four XOR and one NOT gate is required in the middle of the CLA BCD subtractor). ...

Similar publications

Preprint
Full-text available
This paper details the approach to the designand optimization of reversible adder and multiplierutilizing Peres gates which is a three input, three outputgate. Peres gates are recognized for their universality andenergy efficient properties and present an intriguingoption for constructing reversible circuits. Reversiblelogic characterized by its ab...
Article
Full-text available
The reversible logic has the promising applications in emerging computing paradigm such as quantum computing, quantum dot cellular automata, optical computing, etc. In reversible logic gates there is a unique one-to-one mapping between the inputs and outputs. To generate an useful gate function the reversible gates require some constant ancillary i...
Article
Full-text available
The problems pertaining to scaling which the conventional technologies were facing was successfully addressed with the novel technology known as Quantum-dot Cellular Automata (QCA). The principle spin and anti-spin has been already employed for representation of the binary logic states. The QCA has exploited the concept of spin and evolved a new co...
Article
Full-text available
In recent years, reversible logic has emerged as a promising computing paradigm having application in low power CMOS, quantum computing and error detecting. Reversible computing dissipates zero energy in terms of information loss and also it can detect error of circuit by keeping unique input-output mapping. In this paper, we have proposed a regula...
Chapter
Full-text available
Reversible logic is emerging computing paradigm with applications in Ultra-low power Nano computing, Quantum computing, Low power CMOS design, Optical Information Processing, Bioinformatics etc. In this paper, the 4x4 reversible mul-tiplier circuit proposed with the design of new reversible gate called RAM. The proposed multiplier circuit is effici...

Citations

... To date, using the reversible gates, different types of reversible circuits have been introduced. In [35] novel reversible gates have been used to design carry-lookahead and carry-skip BCD subtractors. ...
Article
Full-text available
Quantum-dot cellular automata (QCA) as an emerging nanotechnology are envisioned to overcome the scaling and the heat dissipation issues of the current CMOS technology. In a QCA structure, information destruction plays an essential role in the overall heat dissipation, and in turn in the power consumption of the system. Therefore, reversible logic, which significantly controls the information flow of the system, is deemed suitable to achieve ultra-low-power structures. In order to benefit from the opportunities QCA and reversible logic provide, in this paper, we first review and implement prior reversible full-adder art in QCA. We then propose a novel reversible design based on three- and five-input majority gates, and a robust one-layer crossover scheme. The new full-adder significantly advances previous designs in terms of the optimization metrics, namely cell count, area, and delay. The proposed efficient full-adder is then used to design reversible ripple-carry adders (RCAs) with different sizes (i.e., 4, 8, and 16 bits). It is demonstrated that the new RCAs lead to 33% less garbage outputs, which can be essential in terms of lowering power consumption. This along with the achieved improvements in area, complexity, and delay introduces an ultra-efficient reversible QCA adder that can be beneficial in developing future computer arithmetic circuits and architectures.
... Performance of sequential computation is reaching its limit, so more and more research is focusing on parallel computations. Besides quantum computers [25,[37][38][39], massively parallel systems are considered very promising, and many topologies for their interconnection networks have been proposed [1,2,7,16,32,36,40] and analyzed [3,6,[8][9][10][11][12]42]. The folded hypercube, which is a variant of hypercube, is one such topology. ...
Article
Full-text available
A folded hypercube is obtained by adding complementary links to a hypercube. The diameter of the folded hypercube is almost half of that of the hypercube, while its degree is larger than the degree of the hypercube by only one. In this paper, we propose a stochastic link-fault-tolerant routing algorithm in a folded hypercube by introducing a limited global information called routing probabilities. For an n-dimensional folded hypercube, we have proved that the routing probabilities for all distances can be calculated in \(O(n^2\log n)\) time at each node and the message can be forwarded to its neighbor node at each node in O(n) time. We also conducted a computer experiment, the results of which show that our algorithm achieves a better performance than the best algorithm for a hypercube.
... HPC technologies, parallel processing and parallel applications utilizing various types of strategies, such as more advanced processors, multiple processors, multiple servers, data science and analytics, and heavy scientific computing, have long been an interesting research issue [39][40][41][42][43][44][45][46]. These HPC technologies are often used for climate models or earth system models. ...
Article
Full-text available
High-performance computing for climate models has always been an interesting research area. It is valuable to nest a regional climate model within a global climate model, but large-scale simulation of the nesting or coupling severely challenges to the development of efficient parallel algorithms that fit well into multi-core clusters. This paper first presents research on the coupling of the Institute of Atmospheric Physics of Chinese Academy of Sciences Atmospheric General Circulation Model version 4.0 and the Weather Research and Forecasting model, then proposes an efficient parallel algorithm of the coupling. The algorithm includes initialization of input data, decomposition of computing grid and processes, parallel computing of component models, and data exchange by a coupler. By calling some subroutines of the Model Coupling Toolkit, the parallelization of the proposed algorithm is implemented. Experiments show that the parallel algorithm is very effective and scalable. The parallel efficiency of the algorithm on 1,024 CPU cores can reach up to 70%. Moreover, its parallel efficiency with respect to weak scalability is 72.56% on a multi-core cluster.
... However, the design and implementation of algorithms that can exploit distributed memory environments for solving large-scale flow problems can be found in [1,2]. The design of hardware such that it works in concert with software for obtaining maximum performance on the distribute memory environments can be found in [57,58]. ...
Article
Full-text available
MFiX, a general-purpose Fortran-based suite, simulates the complex flow in fluidized bed applications via BiCGStab and GMRES methods along with plane relaxation preconditioners. Trilinos, an object-oriented framework, contains various first- and second-generation Krylov subspace solvers and preconditioners. We developed a framework to integrate MFiX with Trilinos as MFiX does not possess advanced linear methods. The framework allows MFiX to access advanced linear solvers and preconditioners in Trilinos. The integrated solver is called MFiX–Trilinos, here after. In the present work, we study the performance of variants of GMRES and CGS methods in MFiX–Trilinos and BiCGStab and GMRES solvers in MFiX for a 3D gas–solid fluidized bed problem. Two right preconditioners employed along with various solvers in MFiX–Trilinos are Jacobi and smoothed aggregation. The flow from MFiX–Trilinos is validated against the same from MFiX for BiCGStab and GMRES methods. And, the effect of the preconditioning on the iterative solvers in MFiX–Trilinos is also analyzed. In addition, the effect of left and right smoothed aggregation preconditioning on the solvers is studied. The performance of the first- and second-generation solver stacks in MFiX–Trilinos is studied as well for two different problem sizes. © 2018 Springer Science+Business Media, LLC, part of Springer Nature
... Furthermore, to make hardware and software work in concert on such heterogeneous systems with hybrid LLC, we propose comprehensive management policies to reduce the contention between CPU and GPU units. Similar hardware and software co-design examples and applications can be found in the domain of FPGA, SIMD/GPU, and other processor designs [18][19][20][21][22][23][24]. ...
Article
Full-text available
Shared last-level cache (LLC) in on-chip CPU–GPU heterogeneous architectures is critical to the overall system performance, since CPU and GPU applications usually show completely different characteristics on cache accesses. Therefore, when co-running with CPU applications, GPU ones can easily occupy the majority of the LLC, making CPU applications starve severely. This imposes significant challenges to the design and management of the shared LLC in CPU–GPU heterogeneous architectures. To improve the overall system performance, we consider integrating conventional SRAM and a new memory technology (i.e., STT-RAM) to enlarge the shared LLC. Furthermore, we propose comprehensive management policies to reduce the contention between CPU and GPU units. Experimental results show that, compared with the conventional SRAM-only LLC design, our proposal improves the performance of CPU workloads by 17% while not hurting GPU ones and reduces the LLC energy consumption by 30% on average.
... With the development and application of hardware technology with high reliability and precision [34], the proposal of reversible programming logic array has provided for a paradigm to the arena of reconfigurable computing [32,33]. Reversible logic as a promising computing paradigm has been implemented in quantum computing, nanotechnology, and optical computing and so on [35,36]. Algorithms for fast operations have been designed to exploit SIMD parallel architecture [2] and provide low-power operations [20]. ...
Article
Full-text available
Viewshed analysis has widely been used in various spatial analysis applications. But the expense of viewshed computation remains high both in time and space complexity for large-scale terrain data, so parallel computing technique has been introduced to improve their performance. However, the failure in such a parallel computing system with a lot of computing nodes or processors may lead to an increase in execution time and cost of running viewshed computation. Highly fault-tolerant parallel computing will greatly enhance the reliability of the algorithm without losing its performance. In this article, we present a fault-tolerant computing framework for parallel viewshed computation in a parallel computing system using redundancy computing strategy. Two schedule strategies, layer and axis direction schedule, are adopted, respectively, as primary process and slave process to check whether or not there are errors to occur during the computation. A rollback and re-computation process is presented to correct these errors, while an error is found by comparing the results of the primary process and its slave process. The fault-tolerant algorithm in this article is implemented using process-level and thread-level parallelization. Our method can make full use of multiple processors providing by parallel computing environment without losing the computation efficiency of the algorithm. To illustrate the usefulness of our approach, several experiments are executed by using Xdraw viewshed algorithm. The results demonstrate that our approach achieves the 14.91 of speedup ratio with 16 processes and the 99.4% of average precision rate in comparison with simple checkpoint Xdraw algorithm.
... This operation is usually performed by using full adders and half adders. Although there are many designs for non-parity-preserving adders such as [15,27,28], only the parity-preserving adders can be helpful for the paritypreserving multipliers. There exist some parity-preserving gates that can perform the operation of a parity-preserving full adder (such as F2PG [8], LCG [17] and ZPLG [26]) or half adder (MIG [24] and ZCG [26]) after setting some of their inputs to zero as the constant inputs. ...
Article
Full-text available
Reversible logic as a new promising design domain can be used for DNA computations, nanocomputing, and especially constructing quantum computers. However, the vulnerability to different external effects may lead to deviation from producing correct results. The multiplication is one of the most important operations because of its huge usage in different computing systems. Thus, in this paper, some novel reversible logic array multipliers are proposed with error detection capability through the usage of parity-preserving gates. By utilizing the new arrangements of existing reversible gates, some new circuits are presented for partial product generation and multi-operand addition required in array multipliers which results in two unsigned and three signed parity-preserving array multipliers. The experimental results show that the best of signed and unsigned proposed multipliers have the lowest values among the existing designs regarding the main reversible logic criteria including quantum cost, gate count, constant inputs, and garbage outputs. For \(4\times 4\) multipliers, the proposed designs achieve up to 28 and 46% reduction in the quantum cost and gate count, respectively, compared to the existing designs. Moreover, the proposed unsigned multipliers can reach up to 58% gate count reduction in \(16\times 16\) multipliers.
... In other words, a system is called distributed if the message transmission delay is not negligible compared to the time between the events in a single process [1]. One of the most important aims in distributed systems is to provide an environment conducive to sharing resources [2][3][4][5][6][7][8][9][10][11]. Hence, it is possible that several processes simultaneously request a shared resource. ...
Article
Full-text available
Solving the problem of mutually exclusive access to a critical resource is a major challenge in distributed systems. In some solutions, there is a unique token in the whole system which acts as a privilege to access a critical resource. Practical and easily implemented, the token-ring algorithm is one of the most popular token-based mutual exclusion algorithms known in this field’s literature. However, it suffers from low scalability and a high average waiting time for resource seekers. The present paper proposes a new algorithm which employs a two-dimensional torus logical structure of N processes and the token-ring algorithm concept. It performs in a way that increasingly raises scalability and reduces the average waiting time of the token-ring algorithm. The token makes a circular movement along the columns of the two-dimensional torus (vertical ring), while the requests for the critical resource make a circular movement along the rows of the torus (horizontal ring). In this algorithm, the number of messages exchanged is between 2√N + 1 and 3√N + 1 under light load situations and, under heavy load situations, is at the most three messages per critical section invocation. Thus, in contrast with the leading algorithms, the proposed algorithm has gained significant improvements, in addition to having been proved to operate correctly.
... Field programmable gate arrays (FPGA) were originally used for purely logic-based computations. However, they have been adapted over the past dozen years to integer arithmetic and even floating-point arithmetic calculations [29][30][31][32][33][34][35][36][37]. ...
Article
Full-text available
This paper presents the parallelization on a GPU of the sequential matrix diagonalization (SMD) algorithm, a method for diagonalizing polynomial covariance matrices, which is the most recent technique for polynomial eigenvalue decomposition. We first parallelize with CUDA the calculation of the polynomial covariance matrix. Then, following a formal transformation of the polynomial matrix multiplication code—extensively used by SMD—we insert in this code the cublasDgemm function of CUBLAS library. Furthermore, a specialized cache memory system is implemented within the GPU to greatly limit the PC-to-GPU transfers of slices of polynomial matrices. The resulting SMD code can be applied efficiently over high-dimensional data. The proposed method is verified using sequences of images of airplanes with varying spatial orientation. The performance of the parallel codes for polynomial covariance matrix generation and SMD is evaluated and reveals speedups of up to 161 and 67, respectively, relative to sequential execution on a PC.
... Some works [14][15][16][17][18][19][20][21][22][23][24] presented how hardware and software can work in concert on scalable multiprocessor systems with a number of illustrative matrix-based examples and applications, and provided a historical perspective and relevant context to numerical computation on CPU-GPU heterogeneous computing systems. Some works [25][26][27][28][29][30][31][32][33] discussed how processor technologies can help scientific computing applications which involve matrix operations. Two-level parallelization [34] was introduced to solve a massive block-tridiagonal matrix system. ...
Article
Full-text available
Solving block-tridiagonal systems is one of the key issues in numerical simulations of many scientific and engineering problems. Non-zero elements are mainly concentrated in the blocks on the main diagonal for most block-tridiagonal matrices, and the blocks above and below the main diagonal have little non-zero elements. Therefore, we present a solving method which mixes direct and iterative methods. In our method, the submatrices on the main diagonal are solved by the direct methods in the iteration processes. Because the approximate solutions obtained by the direct methods are closer to the exact solutions, the convergence speed of solving the block-tridiagonal system of linear equations can be improved. Some direct methods have good performance in solving small-scale equations, and the sub-equations can be solved in parallel. We present an improved algorithm to solve the sub-equations by thread blocks on GPU, and the intermediate data are stored in shared memory, so as to significantly reduce the latency of memory access. Furthermore, we analyze cloud resources scheduling model and obtain ten block-tridiagonal matrices which are produced by the simulation of the cloud-computing system. The computing performance of solving these block-tridiagonal systems of linear equations can be improved using our method.