Fig 1 - uploaded by Tian Sheuan Chang
Content may be subject to copyright.
Architecture of the 512-bit RSA processor. 

Architecture of the 512-bit RSA processor. 

Source publication
Article
Full-text available
In this paper, we propose a new algorithm based on Montgomery's algorithm to calculate modular multiplication that is the core arithmetic operation in an RSA cryptosystem. The modified algorithm eliminates over large residue and has very short critical path delay that yields a very high speed processing. The new architecture based on this modified...

Contexts in source publication

Context 1
... the last iteration of modular multiplication operation, we post-process R by taking the result and 1 as input operands to remove the extra factor, i.e., = ( ? + , ). We can observe that if the input operand is 1, the higher n bit of product ( ) will be zero, so the output result of postprocessing will be less than the modulus N . Therefore, we not only remove the unwanted factor + of the result but also make the result fall in the right range after postprocessing. multiplications is dlog 2 E e + v(E) -2, where v(E) is the number of nonzero bits in E. So, for n-bit RSA modular exponentiation with equal probability for 0 and 1, the number of modular multiplication is (2n + 2) for the worst case and (1:5n + 2) for the average case. Algorithm 2 takes (2n + 2) 2 n clock cycles which is shorter than that in [10] and [11] which need (2n + 2) 2 2n cycles to complete a modular exponentiation in the worst case. Since cycle time is equal to that in [10] and [11], our algorithm takes less time to complete RSA operations and has higher throughput. Fig. 1 shows the architecture of a 512-bit RSA processor based on the modified algorithm. We use four 512-bit linear shift regis- ters to store operands needed in computing 512-bit RSA operation ( mod ). The operations of the RSA processor are described below. In the initial stage, RSA operands are loaded into shift registers serially through an 8-bit input buffer. While loading message M into the text register, we shift the exponent register until the first nonzero is the most significant bit and count the number of bits of exponent, dlog 2 Ee. After the initial stages, we start the multiplier. Once the first output bit of the multiplier is ready, we start the Montgomery module immediately. So the execution time of CPA, multiplier, and Montgomery module is almost overlapped. Therefore, the function units of our design are fully utilized during computation. 1) Carry-Propagation Adder and Serial Parallel Multiplier: The carry-propagation adder converts the carry-save form of the output from the Montgomery module to nonredundant binary form. It generates one bit output per cycle to the serial-parallel multiplier for the next iteration. The serial-parallel multiplier shown in Fig. 2 is to realize the multiplication and square of two n + 1 bit numbers. It first generates the n+2 lower bits of a product serially to the Montgomery module, then it stops and holds the n higher bit of the product. The n higher bits of the product will be added with the output of the Montgomery module to get the modular multiplication ...
Context 2
... n is the filter length (window size) and T is themax or min operator, respectively. Equation (1) is called "running" max or min filtering because after each output calculation, the filter window is shifted one position to the right (i.e., it "runs"). The computational complexity, measured in number of compar- isons per output point, is C (n) = n 0 1. It is desirable to construct filter structures that have a smaller number of comparisons per output point in order to speed the filtering process. This is accomplished by employing the "divide-and-conquer" strategy. Let as suppose that the filter window size n is a power of two: n = 2 k . It is easily seen that max or min calculation of n numbers can be split into the max or min calculation of two subsequences of length n=2 each: y i = T (x i ; 11 1;x i0n+1 ) = T [T (xi; 111; x i0(n=2)+1 ); T (x i0(n=2) ; 1 11;xi0n+1)]: (2) This procedure can be repeated recursively until we reach subse- quences of length 2 [4]. In this case, the max or min calculation of two numbers is done by one comparison only. The corresponding flow diagram is shown in Fig. 1 Therefore, the computational complexity of this structure is reduced to C (n) = log 2 n, which is much less than the complexity n 0 1 1057-7130/98$10.00 © 1998 ...

Citations

... Fast and area efficient modular adder architectures are proposed in [20]. In [21], a novel RSA algorithm is described to develop cryptosystems architectures. In [22], high speed realizations of the RSA cryptographic algorithm use PAM's configurability. ...
Chapter
Full-text available
Modular arithmetic computations are used widely in various data security and reliability techniques. Information security systems certainly benefit from the design of energy-efficient modular exponentiation architectures. The use of low-power logic adders to realize modular exponentiation operations is very essential in prospective cryptography contexts. In this paper, various full adder circuit designs are presented which are used in developing an energy efficient modular exponentiation architecture. Here, the full adder is designed using Register Transfer Level (RTL), Standard Logic Cell (SLC), Reversible Logic Gate (RLG), and Artificial Neural Network (ANN) logic methods. All full adder designs are imposed on modular exponentiation circuit to analyze performance metrics in terms of dynamic power dissipation, Figure of Merit, and Energy Delay Product. The Modular Exponentiation architecture is designed based on the above full adders and is simulated and synthesized using Xilinx Vivado Zynq-7000 family configurable device. From the synthesis results, the dynamic power dissipation, Figure of Merit (FOM), and Energy Delay Product (EDP) of the ANN Modular Exponentiation circuit shows an improvement compared to other designs. The total power consumption, dynamic power dissipation, FOM, and EDP of ANN Full Adder and Modular Exponentiation circuit can achieve (8%, 16%), (23.5%, 20.7%), (14.7%, 14%), and (28.5%, 16%) compared to RLG Full adder and Modular Exponentiation circuit.
... The ciphertext can be decrypted using the private key. One of the most well-known asymmetric cryptographic algorithms was developed by Rivest-Shamir-Adleman (RSA) [1][2][3][4]. The RSA algorithm is the earliest public key cryptographic algorithm developed and published for commercial use. ...
Article
Full-text available
This paper presents a review of module ring learning with errors-based (MLWE-based) public-key encryption and key-establishment algorithms. In particular, we introduce the preliminaries of public key cryptography, MLWE-based algorithms, and arithmetic operations in post-quantum cryptography. We then focus on analyzing the state-of-the-art hardware architecture designs of CRYSTALS-Kyber at different security levels, including hardware architectures for Kyber-512, Kyber-768, and Kyber-1024. This analysis is dedicated to providing complete guidelines for selecting the most suitable CRYSTALS-Kyber hardware architecture to apply in post-quantum cryptography-based security systems in reality, with different requirements of security levels and hardware efficiency.
... In general, cryptosystems can be classified into two groups [1][2][3][4], viz., symmetric key cryptosystem also called private-key cryptosystem and asymmetric key cryptosystem which are usually called the public-key cryptosystem [1][2][3][4][5][6][7][8][9]. Since the advent of asymmetric key cryptography, originally proposed by Diffie and Hellman [2] in 1976, the security of systems and communication became much stronger [4]. ...
... Since the advent of asymmetric key cryptography, originally proposed by Diffie and Hellman [2] in 1976, the security of systems and communication became much stronger [4]. Cryptosystems achieve stronger security by the application of modular arithmetic algorithms, like multiplication and exponentiation [4][5][6][7][8][9], and by suppressing the symmetrical keys. Asymmetric cryptosystem has two different keys for the encryption and decryption algorithm [10,11]. ...
... The work in this paper focuses on the hardware implementation of a widely used asymmetric cryptographic algorithm; Rivest-Shamir-Adleman (RSA) algorithm [5]. To implement the RSA algorithm in hardware, it requires different arithmetic blocks like modular multiplication and exponentiation [5][6][7][8][9][10][11], etc. Various designs have been proposed to achieve the hardware implementation goals (either speed improvement or area reduction). To achieve speed improvement, repetitive modular multiplication and modular exponentiation have been proposed in [13]. ...
Article
Full-text available
Efficient hardware implementations of public-key cryptosystems have been gaining interest in the past few decades. To achieve the goal, a high frequency as well as low latency Rivest-Shamir-Adleman (RSA) cryptosystem is reported in this paper. To configure such cryptosystem shift-add multiplier have been re-constructed and binary digit based modular exponentiation circuitry is proposed. Such exponentiation circuitry has been implemented through binary bit distribution technique, where, most significant bit (MSB) has been discarded for the implementation, owing to increase the operating frequency. The functionality of the reported algorithms were justified and compared in Hardware Description Language (HDL), simulated in Modelsim and synthesized in Xilinx ISE 14.2 platform. The proposed hardware implementation of RSA algorithm has a maximum frequency of operation of 545 MHz and 298 MHz for the bit sizes of 8 and 64 respectively. The proposed method shows improvements in terms of speed as well as in number of Look-up-tables (LUTs). Moreover, application-specific integrated circuit (ASIC) implementation of such cryptosystem of RSA was carried out through Encounter® RTL Compiler v11.10-p005_1 of Cadence® tool.
... From the benchmarks, we conclude that the proposed design OptR4 is competitive enough compared to other designs and it achieves an optimal design in respect of low-latency, resource-efficient and scalability for dealing with power and resource consumption issues. Yang '98 [12] 125 540k 74k 118 0.6 µm No Su '99 [13] 100 510k 76k 100 0.6 µm No Hsieh '99 [14] 125 6500k 4.5k 10.5 0.6 µm Yes Leu '00 [15] 105 270k 78k 200 0.6 µm No Yingli '01 [16] 40 180k 96k 113 0.6 µm Yes Hong '03 [17] 300 530k 77k 289 0.6 µm No Sun '03 [18] 220 405k 40k 276 0.35 µm Yes Hong '06 [19] A low-latency and resource-efficient scalable RSA cryptoprocessor architecture is proposed. It is achieved successfully using two approaches, namely the optimization of Radix-4 Montgomery multiplication and scalable architecture. ...
Preprint
Full-text available
RSA is one of the well-known cryptography method used in asymmetric cryptosystems. But, RSA challenges on architecture, performance, power and resource consumption still can be improved. In this research, we propose a low-latency and resource-efficient scalable RSA cryptoprocessor architecture to deal with power and resource consumption issues. It is obtained using two approaches. First, optimization of Radix-4 Montgomery multiplication that yields the reduction on resource utilization and latency. Second, designing a scalable architecture based on the optimized Radix-4 Montgomery multiplication. The proposed design is verified in FPGA through simulation and image encryption application. Synthesis results show that the proposed design achieves an optimal design in respect of low-latency, resource-efficient and scalability. It only requires 227k cycles latency and consumes 13k logic gate utilization for 512-bit RSA.
... In reduction, the precomputation ψ = −θ −1 mod 2 δ is used. The hardware architectures of Montgomery modular multiplication processing the multiplier bits, δ = 1 bit at a time, are abundant in the literature [5]- [14]. In RSA and ECC applications, the modulus θ is an odd integer. ...
... The following algorithm implements the Montgomery multiplication r = ab2 −n mod θ given by (4) and (5). ...
Article
The Montgomery algorithm is a fast modular multiplication method frequently used in cryptographic applications. This paper investigates the digit-serial implementations of the Montgomery algorithm for large integers. A detailed analysis is given and a tight upper bound is presented for the intermediate results obtained during the digit-serial computation. Based on this analysis, an efficient digit-serial Montgomery modular multiplier architecture using carry save adders is proposed and its complexity is presented. In this architecture, pipelined carry select adders are used to perform two final tasks: adding carry save vectors representing the modular product and subtracting the modulus from this addition, if further reduction is needed. The proposed architecture can be designed for any digit size δ and modulus θ. This paper also presents logic formulas for the bits of the precomputation -θ⁻¹ mod 2δ used in the Montgomery algorithm for δ≤8. Finally, evaluation of the proposed architecture on Virtex 7 FPGAs is presented.
... Several, performance-optimized implementations [8][9][10][11][12][13][14][15][16][17][18][19][20][21] exist and any of them can be seamlessly incorporated into this proposed hybrid crypto system. ...
Conference Paper
Full-text available
This paper proposes a hybrid crypto system that utilizes benefits of both symmetric key and public key cryptographic methods. Symmetric key algorithms (DES and AES) are used in the crypto system to perform data encryption. Public key algorithm (RSA) is used in the crypto system to provide key encryption before key exchange. Combining both the symmetric-key and public-key algorithms provides greater security and some unique features which are only possible in this hybrid system. The crypto system design is modeled using Verilog HDL. The implementation has various modules for DES, AES and RSA. The implementation also has a pseudorandom number generation unit for random generation of keys and a GCD computation unit for RSA. All the hardware modules are designed by Register Transfer Level (RTL) modeling of Verilog HDL using Model Sim SE 5.7e. showing interesting promising results.
... Montgomery multiplication is used in modular arithmetic as an efficient way of performing an exponentiation of two numbers modulo a large number, that is, A B mod N. Algorithm 2 [18] demonstrates the computation of A B mod N used in RSA encryption and decryption functions. Step 7: Set i := i -1 ...
... Montgomery multiplication is used in modular arithmetic as an efficient way of performing an exponentiation of two numbers modulo a large number, that is, A B mod N. Algorithm 2 [18] demonstrates the computation of A B mod N used in RSA encryption and decryption functions. Inputs: A, B(exponent) = (1b k-2 b k-3 , .. b 2 b 1 b 0 ) 2 , N(modulus), C(constant) = 2 (n+2) mod N Output: R = A B mod N, 0 <= R < N MME (A, B, N, C) ...
Article
Full-text available
Problem Statement: Arithmetic Logic Unit (ALU) of a crypto-processor and microchips leak information through power consumption. Although the cryptographic protocols are secured against mathematical attacks, the attackers can break the encryption by measuring the energyconsumption. Approach: To thwart attacks, this study proposed the use of reversible logic for designing the ALU of a crypto-processor. Ideally, reversible circuits do not dissipate any energy. If reversible circuits are used, then the attacker would not be able to analyze the power consumption. In order to design the reversible ALU of a crypto-processor, reversible Carry Save Adder (CSA) usingModified TSG (MTSG) gates and architecture of Montgomery multiplier were proposed. For reversible implementation of Montgomery multiplier, efficient reversible multiplexers and sequentialcircuits such as reversible registers and shift registers were presented. Results: This study showed that modified designs perform better than the existing ones in terms of number of gates, number of garbage outputs and quantum cost. Lower bounds of the proposed designs were established by providing relevant theorems and lemmas. Conclusion: The application of reversible circuit is suitable to the field of hardware cryptography.
... Exponentiation and multiplication of large integers is the basic of several well known cryptographic algorithms such as RSA [1], Elliptic Curve cryptography (ECC) [2,3], NTRU [4,5], Etc. As a result methods which speed up implementations of multiplication and exponentiation are of considerable practical significance [6,7,8,9,10,11]. Methods like Montgomery [6] and RNS [12,13,14] and using redundant number are some examples of this try [15]. ...
Chapter
Full-text available
Today multi operand addition is used in many aspect of computer arithmetic such as multiplication, exponentiation, etc. One of the best method for multi addition is Carry save adder that has no carry propagation during intermediate summation. This paper introduce a new method that has a performance like carry save adder for multi-addition but has fewer gates than it. This architecture can reduce the number of logic gates by 40%.
... Several approaches have been proposed in the literature for the implementation of Montgomery's multiplication [2]. In view of the expanding demand of security services on embedded machinery, great efforts have been recently devoted to developing efficient implementations of that algorithm on FPGAs, DSPs, and microcontrollers [4][5][6][7][8][9][10][11][12][13][14][15]. ...
Article
Full-text available
Montgomery's algorithm is a popular technique to speed up modular multiplications in public-key cryptosystems. This paper tackles the efficient support of modular exponentiation on inexpensive circuitry for embedded security services and proposes a variant of the finely integrated product scanning (FIPS) algorithm that is targeted to digital signal processors. The general approach improves on the basic FIPS formulation by removing potential inefficiencies and boosts the exploitation of computing resources. The reformulation of the basic FIPS structure results in a general approach that balances computational efficiency and flexibility. Experimental results on commercial DSP platforms confirm both the method's validity and its effectiveness.