Architecture of the 512-bit RSA processor.

Source publication

A new RSA cryptosystem hardware design based on Montgomery's algorithm

Article

Full-text available

Aug 1998

In this paper, we propose a new algorithm based on Montgomery's algorithm to calculate modular multiplication that is the core arithmetic operation in an RSA cryptosystem. The modified algorithm eliminates over large residue and has very short critical path delay that yields a very high speed processing. The new architecture based on this modified...

Context 1

... the last iteration of modular multiplication operation, we post-process R by taking the result and 1 as input operands to remove the extra factor, i.e., = ( ? + , ). We can observe that if the input operand is 1, the higher n bit of product ( ) will be zero, so the output result of postprocessing will be less than the modulus N . Therefore, we not only remove the unwanted factor + of the result but also make the result fall in the right range after postprocessing. multiplications is dlog 2 E e + v(E) -2, where v(E) is the number of nonzero bits in E. So, for n-bit RSA modular exponentiation with equal probability for 0 and 1, the number of modular multiplication is (2n + 2) for the worst case and (1:5n + 2) for the average case. Algorithm 2 takes (2n + 2) 2 n clock cycles which is shorter than that in [10] and [11] which need (2n + 2) 2 2n cycles to complete a modular exponentiation in the worst case. Since cycle time is equal to that in [10] and [11], our algorithm takes less time to complete RSA operations and has higher throughput. Fig. 1 shows the architecture of a 512-bit RSA processor based on the modified algorithm. We use four 512-bit linear shift regis- ters to store operands needed in computing 512-bit RSA operation ( mod ). The operations of the RSA processor are described below. In the initial stage, RSA operands are loaded into shift registers serially through an 8-bit input buffer. While loading message M into the text register, we shift the exponent register until the first nonzero is the most significant bit and count the number of bits of exponent, dlog 2 Ee. After the initial stages, we start the multiplier. Once the first output bit of the multiplier is ready, we start the Montgomery module immediately. So the execution time of CPA, multiplier, and Montgomery module is almost overlapped. Therefore, the function units of our design are fully utilized during computation. 1) Carry-Propagation Adder and Serial Parallel Multiplier: The carry-propagation adder converts the carry-save form of the output from the Montgomery module to nonredundant binary form. It generates one bit output per cycle to the serial-parallel multiplier for the next iteration. The serial-parallel multiplier shown in Fig. 2 is to realize the multiplication and square of two n + 1 bit numbers. It first generates the n+2 lower bits of a product serially to the Montgomery module, then it stops and holds the n higher bit of the product. The n higher bits of the product will be added with the output of the Montgomery module to get the modular multiplication ...

View in full-text

Context 2

... n is the filter length (window size) and T is themax or min operator, respectively. Equation (1) is called "running" max or min filtering because after each output calculation, the filter window is shifted one position to the right (i.e., it "runs"). The computational complexity, measured in number of compar- isons per output point, is C (n) = n 0 1. It is desirable to construct filter structures that have a smaller number of comparisons per output point in order to speed the filtering process. This is accomplished by employing the "divide-and-conquer" strategy. Let as suppose that the filter window size n is a power of two: n = 2 k . It is easily seen that max or min calculation of n numbers can be split into the max or min calculation of two subsequences of length n=2 each: y i = T (x i ; 11 1;x i0n+1 ) = T [T (xi; 111; x i0(n=2)+1 ); T (x i0(n=2) ; 1 11;xi0n+1)]: (2) This procedure can be repeated recursively until we reach subse- quences of length 2 [4]. In this case, the max or min calculation of two numbers is done by one comparison only. The corresponding flow diagram is shown in Fig. 1 Therefore, the computational complexity of this structure is reduced to C (n) = log 2 n, which is much less than the complexity n 0 1 1057-7130/98$10.00 © 1998 ...

View in full-text

Design and FPGA Realization of an Energy Efficient Artificial Neural Modular Exponentiation Architecture

Chapter

Full-text available

Jan 2023

Modular arithmetic computations are used widely in various data security and reliability techniques. Information security systems certainly benefit from the design of energy-efficient modular exponentiation architectures. The use of low-power logic adders to realize modular exponentiation operations is very essential in prospective cryptography contexts. In this paper, various full adder circuit designs are presented which are used in developing an energy efficient modular exponentiation architecture. Here, the full adder is designed using Register Transfer Level (RTL), Standard Logic Cell (SLC), Reversible Logic Gate (RLG), and Artificial Neural Network (ANN) logic methods. All full adder designs are imposed on modular exponentiation circuit to analyze performance metrics in terms of dynamic power dissipation, Figure of Merit, and Energy Delay Product. The Modular Exponentiation architecture is designed based on the above full adders and is simulated and synthesized using Xilinx Vivado Zynq-7000 family configurable device. From the synthesis results, the dynamic power dissipation, Figure of Merit (FOM), and Energy Delay Product (EDP) of the ANN Modular Exponentiation circuit shows an improvement compared to other designs. The total power consumption, dynamic power dissipation, FOM, and EDP of ANN Full Adder and Modular Exponentiation circuit can achieve (8%, 16%), (23.5%, 20.7%), (14.7%, 14%), and (28.5%, 16%) compared to RLG Full adder and Modular Exponentiation circuit.

An Analysis of Hardware Design of MLWE-Based Public-Key Encryption and Key-Establishment Algorithms

Article

Full-text available

Mar 2022

This paper presents a review of module ring learning with errors-based (MLWE-based) public-key encryption and key-establishment algorithms. In particular, we introduce the preliminaries of public key cryptography, MLWE-based algorithms, and arithmetic operations in post-quantum cryptography. We then focus on analyzing the state-of-the-art hardware architecture designs of CRYSTALS-Kyber at different security levels, including hardware architectures for Kyber-512, Kyber-768, and Kyber-1024. This analysis is dedicated to providing complete guidelines for selecting the most suitable CRYSTALS-Kyber hardware architecture to apply in post-quantum cryptography-based security systems in reality, with different requirements of security levels and hardware efficiency.

Fast and Area Efficient Implementation of RSA Algorithm

Article

Full-text available

Jan 2019

Efficient hardware implementations of public-key cryptosystems have been gaining interest in the past few decades. To achieve the goal, a high frequency as well as low latency Rivest-Shamir-Adleman (RSA) cryptosystem is reported in this paper. To configure such cryptosystem shift-add multiplier have been re-constructed and binary digit based modular exponentiation circuitry is proposed. Such exponentiation circuitry has been implemented through binary bit distribution technique, where, most significant bit (MSB) has been discarded for the implementation, owing to increase the operating frequency. The functionality of the reported algorithms were justified and compared in Hardware Description Language (HDL), simulated in Modelsim and synthesized in Xilinx ISE 14.2 platform. The proposed hardware implementation of RSA algorithm has a maximum frequency of operation of 545 MHz and 298 MHz for the bit sizes of 8 and 64 respectively. The proposed method shows improvements in terms of speed as well as in number of Look-up-tables (LUTs). Moreover, application-specific integrated circuit (ASIC) implementation of such cryptosystem of RSA was carried out through Encounter® RTL Compiler v11.10-p005_1 of Cadence® tool.

A Low-Latency and Resource-Efficient Scalable RSA Cryptoprocessor Architecture

Preprint

Full-text available

Dec 2017

RSA is one of the well-known cryptography method used in asymmetric cryptosystems. But, RSA challenges on architecture, performance, power and resource consumption still can be improved. In this research, we propose a low-latency and resource-efficient scalable RSA cryptoprocessor architecture to deal with power and resource consumption issues. It is obtained using two approaches. First, optimization of Radix-4 Montgomery multiplication that yields the reduction on resource utilization and latency. Second, designing a scalable architecture based on the optimized Radix-4 Montgomery multiplication. The proposed design is verified in FPGA through simulation and image encryption application. Synthesis results show that the proposed design achieves an optimal design in respect of low-latency, resource-efficient and scalability. It only requires 227k cycles latency and consumes 13k logic gate utilization for 512-bit RSA.

A General Digit-Serial Architecture for Montgomery Modular Multiplication

Article

Feb 2017
IEEE T VLSI SYST

The Montgomery algorithm is a fast modular multiplication method frequently used in cryptographic applications. This paper investigates the digit-serial implementations of the Montgomery algorithm for large integers. A detailed analysis is given and a tight upper bound is presented for the intermediate results obtained during the digit-serial computation. Based on this analysis, an efficient digit-serial Montgomery modular multiplier architecture using carry save adders is proposed and its complexity is presented. In this architecture, pipelined carry select adders are used to perform two final tasks: adding carry save vectors representing the modular product and subtracting the modulus from this addition, if further reduction is needed. The proposed architecture can be designed for any digit size δ and modulus θ. This paper also presents logic formulas for the bits of the precomputation -θ⁻¹ mod 2δ used in the Montgomery algorithm for δ≤8. Finally, evaluation of the proposed architecture on Virtex 7 FPGAs is presented.

Hybrid Crypto Hardware Utilizing Symmetric-Key & Public-Key Cryptosystems

Conference Paper

Full-text available

Nov 2012

This paper proposes a hybrid crypto system that utilizes benefits of both symmetric key and public key cryptographic methods. Symmetric key algorithms (DES and AES) are used in the crypto system to perform data encryption. Public key algorithm (RSA) is used in the crypto system to provide key encryption before key exchange. Combining both the symmetric-key and public-key algorithms provides greater security and some unique features which are only possible in this hybrid system. The crypto system design is modeled using Verilog HDL. The implementation has various modules for DES, AES and RSA. The implementation also has a pseudorandom number generation unit for random generation of keys and a GCD computation unit for RSA. All the hardware modules are designed by Register Transfer Level (RTL) modeling of Verilog HDL using Model Sim SE 5.7e. showing interesting promising results.

Efficient Reversible Montgomery Multiplier and Its Application to Hardware Cryptography

Article

Jan 2009
J Comput Sci

Nayeem

Efficient Reversible Montgomery Multiplier and Its Application to Hardware Cryptography

Article

Full-text available

Jan 2009
J Comput Sci

Problem Statement: Arithmetic Logic Unit (ALU) of a crypto-processor and microchips leak information through power consumption. Although the cryptographic protocols are secured against mathematical attacks, the attackers can break the encryption by measuring the energyconsumption. Approach: To thwart attacks, this study proposed the use of reversible logic for designing the ALU of a crypto-processor. Ideally, reversible circuits do not dissipate any energy. If reversible circuits are used, then the attacker would not be able to analyze the power consumption. In order to design the reversible ALU of a crypto-processor, reversible Carry Save Adder (CSA) usingModified TSG (MTSG) gates and architecture of Montgomery multiplier were proposed. For reversible implementation of Montgomery multiplier, efficient reversible multiplexers and sequentialcircuits such as reversible registers and shift registers were presented. Results: This study showed that modified designs perform better than the existing ones in terms of number of gates, number of garbage outputs and quantum cost. Lower bounds of the proposed designs were established by providing relevant theorems and lemmas. Conclusion: The application of reversible circuit is suitable to the field of hardware cryptography.

A New Operator for Multi-addition Calculations

Chapter

Full-text available

Jan 2009

Today multi operand addition is used in many aspect of computer arithmetic such as multiplication, exponentiation, etc. One of the best method for multi addition is Carry save adder that has no carry propagation during intermediate summation. This paper introduce a new method that has a performance like carry save adder for multi-addition but has fewer gates than it. This architecture can reduce the number of logic gates by 40%.

Enhanced Montgomery Multiplication on DSP Architectures for Embedded Public-Key Cryptosystems

Article

Full-text available

Jan 2008

Montgomery's algorithm is a popular technique to speed up modular multiplications in public-key cryptosystems. This paper tackles the efficient support of modular exponentiation on inexpensive circuitry for embedded security services and proposes a variant of the finely integrated product scanning (FIPS) algorithm that is targeted to digital signal processors. The general approach improves on the basic FIPS formulation by removing potential inefficiencies and boosts the exploitation of computing resources. The reformulation of the basic FIPS structure results in a general approach that balances computational efficiency and flexibility. Experimental results on commercial DSP platforms confirm both the method's validity and its effectiveness.

Architecture of the 512-bit RSA processor.

Contexts in source publication

Citations