Figure 3 - uploaded by Martin Simka
Content may be subject to copyright.
Multiplier Stage with CarryPropagate Adders and Non-redundant Representation

Multiplier Stage with CarryPropagate Adders and Non-redundant Representation

Source publication
Conference Paper
Full-text available
The security of the most popular asymmetric cryptographic scheme RSA depends on the hardness of factoring large numbers. The best known method for factorization large integers is the general number field sieve (GNFS). Recently, architectures for special purpose hardware for the GNFS have been proposed. One important step within the GNFS is the fact...

Context in source publication

Context 1
... inter- nal word additions are performed by simple adders. Figure 3 shows the architectures of one stage. While in [13] a structure with carry-save adders and redundant representation of operands has been implemented, we have chosen a configuration with carry-propagate adders and non-redundant repre- sentation that makes possible more effective imple- mentation especially when target platform supports fast carry chain logic. ...

Similar publications

Article
Full-text available
Today’s developing era data and information security plays an important role in unsecured communication between Internet of Things (IoT) elements. In IoT, data are transmitted in plaintext for many reasons. One of the most common reason is the availability of hardware. Many IoT products are inexpensive components with limited memory and computation...
Conference Paper
Full-text available
We present an FPGA-based accelerator for elliptic curve cryptography on a Koblitz curve targeting for applications requiring very high speed. The accelerator supports fast computation of point multiplication by using window methods as well as multiple point multiplications with joint sparse form representations. Optimized operation-specific process...
Article
Full-text available
Montgomery Multiplication is a common and important algorithm for improving the efficiency of public key cryptographic algorithms, like RSA and Elliptic Curve Cryptography (ECC). A natural choice for implementing this time consuming multiplication defined on finite fields, mainly over GF(2m)GF(2m), is the use of Field Programmable Gate Arrays (FPGA...
Conference Paper
Full-text available
Elliptic Curve Cryptography (ECC) is becoming unavoidable, and should be used for public key protocols. It has gained increasing acceptance in practice due to the significantly smaller bit size of the operands compared to RSA for the same security level. Most protocols based on ECC imply the computation of a scalar multiplication. ECC can be perfor...
Conference Paper
Full-text available
This paper discusses architectures for embedded security to enable various cryptographic services at low cost. To realize the large bit-lengths and complex arithmetic on an 8-bit embedded micro-controller, several hardware acceleration options for elliptic and hyperelliptic curve cryptography (ECC and HECC) are studied and systematically evaluated....

Citations

... The paper showed that a single 2002-era FPGA could recover a 40b key in 50 hours and that the approach could be trivially parallelized onto multiple FPGAs. Highly efficient FPGA hardware implementations of sieve operations that factor large numbers were developed [223]. Subsequent work attacking ECC [224] showed that 160b ECC keys were probably still safe, but 112b ECC keys were potentially vulnerable. ...
Article
Full-text available
Reconfigurable architectures can bring unique capabilities to computational tasks. They offer the performance and energy efficiency of hardware with the flexibility of software. In some domains, they are the only way to achieve the required, real-time performance without fabricating custom integrated circuits. Their functionality can be upgraded and repaired during their operational lifecycle and specialized to the particular instance of a task. We survey the field of reconfigurable computing, providing a guide to the body-of-knowledge accumulated in architecture, compute models, tools, run-time reconfiguration, and applications.
... In this thesis we see ECM at work in the context of "cofactorization" [111], [110] , a step of the NFS (see 2.6.4) to factor relatively small auxiliary positive integers. Several works have explored this application of the algorithm [62,75,191,89,133,166,214,39]. However, ECM has also two applications in the context of large integer factorization. ...
... Most previous work in this direction focussed on offloading the elliptic curve integer factoring (ECM, [130]), which is only part of this follow-up stage. For graphics processing units (GPUs) this is considered in [19,17,39] and for reconfigurable hardware such as field-programmable gate arrays in [191,166,75,62,89,133,214]. ...
... Related work on stage 1 of ECM for cofactoring on constrained devices can be found in [191,166,75,62,89,133,214,19,17,39]. Unlike these publications, the GPU-implementation presented here includes stage 2 of ECM, as it significantly improves the performance of ECM. ...
Article
The RSA cryptosystem introduced in 1977 by Ron Rivest, Adi Shamir and Len Adleman is the most commonly deployed public-key cryptosystem. Elliptic curve cryptography (ECC) introduced in the mid 80's by Neal Koblitz and Victor Miller is becoming an increasingly popular alternative to RSA offering competitive performance due the use of smaller key sizes. Most recently hyperelliptic curve cryptography (HECC) has been demonstrated to have comparable and in some cases better performance than ECC. The security of RSA relies on the integer factorization problem whereas the security of (H)ECC is based on the (hyper)elliptic curve discrete logarithm problem ((H)ECDLP). In this thesis the practical performance of the best methods to solve these problems is analyzed and a method to generate secure ephemeral ECC parameters is presented. The best publicly known algorithm to solve the integer factorization problem is the number field sieve (NFS). Its most time consuming step is the relation collection step. We investigate the use of graphics processing units (GPUs) as accelerators for this step. In this context, methods to efficiently implement modular arithmetic and several factoring algorithms on GPUs are presented and their performance is analyzed in practice. In conclusion, it is shown that integrating state-of-the-art NFS software packages with our GPU software can lead to a speed-up of 50%. In the case of elliptic and hyperelliptic curves for cryptographic use, the best published method to solve the (H)ECDLP is the Pollard rho algorithm. This method can be made faster using classes of equivalence induced by curve automorphisms like the negation map. We present a practical analysis of their use to speed up Pollard rho for elliptic curves and genus 2 hyperelliptic curves defined over prime fields. As a case study, 4 curves at the 128-bit theoretical security level are analyzed in our software framework for Pollard rho to estimate their practical security level. In addition, we present a novel many-core architecture to solve the ECDLP using the Pollard rho algorithm with the negation map on FPGAs. This architecture is used to estimate the cost of solving the Certicom ECCp-131 challenge with a cluster of FPGAs. Our design achieves a speed-up factor of about 4 compared to the state-of-the-art. Finally, we present an efficient method to generate unique, secure and unpredictable ephemeral ECC parameters to be shared by a pair of authenticated users for a single communication. It provides an alternative to the customary use of fixed ECC parameters obtained from publicly available standards designed by untrusted third parties. The effectiveness of our method is demonstrated with a portable implementation for regular PCs and Android smartphones. On a Samsung Galaxy S4 smartphone our implementation generates unique 128-bit secure ECC parameters in 50 milliseconds on average.
... Most previous work in this direction focussed on offloading the elliptic curve integer factoring (ECM, [31]), which is only part of this follow-up stage. For graphics processing units (GPUs) this is considered in [7,5,10] and for reconfigurable hardware such as field-programmable gate arrays in [53,45,17,14,19,32,58]. To allow the CPUs to keep sieving, thus optimally using their memory, in this paper the possibility is explored to offload the entire follow-up stage to GPUs. ...
... Related work on stage 1 of ECM for cofactoring on constrained devices can be found in [53,45,17,14,19,32,58,7,5,10]. Unlike these publications, the GPU-implementation presented here includes stage 2 of ECM, as it significantly improves the performance of ECM. ...
Conference Paper
Full-text available
We show how the cofactorization step, a compute-intensive part of the relation collection phase of the number field sieve (NFS), can be farmed out to a graphics processing unit. Our implementation on a GTX 580 GPU, which is integrated with a state-of-the-art NFS implementation, can serve as a cryptanalytic co-processor for several Intel i7-3770K quad-core CPUs simultaneously. This allows those processors to focus on the memory-intensive sieving and results in more useful NFS-relations found in less time.
... Using ECM as a tool to factor many small numbers inside NFS is an active research area by itself. Offloading this work to reconfigurable hardware such as field-programmable gate arrays is studied in [37,16,11,17,25,40] while [5,4] considers parallel architectures such as graphics processing units (GPUs) and the Cell broadband engine architecture. A comparison between software and hardware based solutions is presented in [21]. ...
Conference Paper
Full-text available
The performance of the elliptic curve method (ECM) for integer factorization plays an important role in the security assessment of RSA-based protocols as a cofactorization tool inside the number field sieve. The efficient arithmetic for Edwards curves found an application by speeding up ECM. We propose techniques based on generating and combining addition-subtracting chains to optimize Edwards ECM in terms of both performance and memory requirements. This makes our approach very suitable for memory-constrained devices such as graphics processing units (GPU). For commonly used ECM parameters we are able to lower the required memory up to a factor 55 compared to the state-of-the-art Edwards ECM approach. Our ECM implementation on a GTX 580 GPU sets a new throughput record, outperforming the best GPU, CPU and FPGA results reported in literature.
... The parameters for stage 1 (stage 2) in ECM varied depending on the composite size and ranged from 150 (9 000) to 500 (36 000) where often only a single curve was tried with a maximum of around eight curves. This area, using ECM for cofactorization, has seen a flurry of recent activity: see [68,84,95,138,160,186,208] for implementations of ECM targeted at small integers on reconfigurable hardware such as field-programmable gate arrays and [17,18] for GPUs. In [17] the Cell architecture is covered as well. ...
Article
Nowadays, the most popular public-key cryptosystems are based on either the integer factorization or the discrete logarithm problem. The feasibility of solving these mathematical problems in practice is studied and techniques are presented to speed-up the underlying arithmetic on parallel architectures. The fastest known approach to solve the discrete logarithm problem in groups of elliptic curves over finite fields is the Pollard rho method. The negation map can be used to speed up this calculation by a factor √2. It is well known that the random walks used by Pollard rho when combined with the negation map get trapped in fruitless cycles. We show that previously published approaches to deal with this problem are plagued by recurring cycles, and we propose effective alternative countermeasures. Furthermore, fast modular arithmetic is introduced which can take advantage of prime moduli of a special form using efficient "sloppy reduction." The effectiveness of these techniques is demonstrated by solving a 112-bit elliptic curve discrete logarithm problem using a cluster of PlayStation 3 game consoles: breaking a public-key standard and setting a new world record. The elliptic curve method (ECM) for integer factorization is the asymptotically fastest method to find relatively small factors of large integers. From a cryptanalytic point of view the performance of ECM gives information about secure parameter choices of some cryptographic protocols. We optimize ECM by proposing carry-free arithmetic modulo Mersenne numbers (numbers of the form 2M – 1) especially suitable for parallel architectures. Our implementation of these techniques on a cluster of PlayStation 3 game consoles set a new record by finding a 241-bit prime factor of 21181 – 1. A normal form for elliptic curves introduced by Edwards results in the fastest elliptic curve arithmetic in practice. Techniques to reduce the temporary storage and enhance the performance even further in the setting of ECM are presented. Our results enable one to run ECM efficiently on resource-constrained platforms such as graphics processing units.
... In general, the larger values of B1 and B2 increase the probability of success in Phase 1 and Phase 2 respectively (and thus decrease the expected number of curves), but at the same time, increase the execution time per curve of these phases. A theoretical analysis of the optimal parameter choices is given in [26], with a view towards software implementations. The techniques developed there -which use Dickman's function to estimate the probability of success of the Elliptic Curve Method -can be adapted to a hardware setting and make it possible to determine optimal parameter choices via numerical approximations to Dickman's function. ...
Article
Full-text available
A novel portable hardware architecture of the Elliptic Curve Method of factoring, designed and optimized for application in the relation collection step of the Number Field Sieve, is described and analyzed. A comparison with an earlier proof-of-concept design by Pelzl et al. has been performed, and a substantial improvement has been demonstrated in terms of both the execution time and the area-time product. The ECM architecture has been ported across five different families of FPGA devices in order to select the family with the best performance to cost ratio. A timing comparison with the highly optimized software implementation, GMP-ECM, has been performed. Our results indicate that low-cost families of FPGAs, such as Spartan-3 and Spartan-3E, offer at least an order of magnitude improvement over the same generation of microprocessors in terms of the performance to cost ratio, without the use of embedded FPGA resources, such as embedded multipliers.
... That task requires little memory and is therefore best outsourced to cheap devices, so sieving is not interrupted and all resources are used in a cost-conscious fashion. This area has seen a flurry of recent activity: see [49], [45], [24], [22] for implementations on reconfigurable hardware such as fieldprogrammable gate arrays and [7], [6] for GPUs. In [6] the Cell architecture is covered as well. ...
Article
Full-text available
This paper describes carry-less arithmetic operations modulo an integer 2^M-1 in the thousand-bit range, targeted at single instruction multiple data platforms and applications where overall throughput is the main performance criterion. Using an implementation on a cluster of PlayStation 3 game consoles a new record was set for the elliptic curve method for integer factorization.
... An important step in the GNFS algorithm is the factorization of mid-sized numbers for smoothness testing. For this purpose, the ECM has been proposed by Lenstra [30], which has been proved to be suitable for parallel hardware architectures in [9], [15], and [44], particularly on FPGAs. The ECM algorithm performs a very high number of operations on a very small set of input data and is not demanding in terms of high communication bandwidth. ...
Article
Full-text available
Cryptanalysis of ciphers usually involves massive computations. The security parameters of cryptographic algorithms are commonly chosen so that attacks are infeasible with available computing resources. This contribution presents a variety of cryptanalytical applications utilizing the COPACOBANA (Cost-Optimized Parallel Code Breaker) machine which is a high-performance, low-cost cluster consisting of 120 Field Programmable Gate Arrays (FPGA). COPACOBANA appears to be the only such reconfigurable parallel FPGA machine optimized for code breaking tasks reported in the open literature. Depending on the actual algorithm, the parallel hardware architecture can outperform conventional computers by several orders of magnitude. In this work, we will focus on novel implementations of cryptanalytical algorithms, utilizing the impressive computational power of COPACOBANA. We describe various exhaustive key search attacks on symmetric ciphers and demonstrate an attack on a security mechanism employed in the electronic passport. Furthermore, we describe time-memory tradeoff techniques which can, e.g., be used for attacking the popular A5/1 algorithm used in GSM voice encryption. In addition, we introduce efficient implementations of more complex cryptanalysis on asymmetric cryptosystems, e.g., Elliptic Curve Cryptosystems (ECC) and number co-factorization for RSA.
... On the other hand, Franke et al. proposed a sophisticated design SHARK by using a butterfly-sorting [FKP+05]. In order to accelerate the sieving step, FPGA implementations of the mini-factoring were discussed in [FKP+05,SPK+05,GKB+06]. In spite of these theoretical efforts, no implementational results of the whole sieving part on ASIC or FPGA have been known up to the present. 1 One of the reason may be that designing and manufacturing such dedicated devices require a large amount of money and time. ...
Conference Paper
Full-text available
The hardness of the integer factorization problem assures the security of some public-key cryptosystems including RSA, and the number field sieve method (NFS), the most efficient algorithm for factoring large integers currently, is a threat for such cryptosystems. Recently, dedicated factoring devices attract much attention since it might reduce the computing cost of the number field sieve method. In this paper, we report implementational and experimental results of a dedicated sieving device “CAIRN 2” with Xilinx’s FPGA which is designed to handle up to 768-bit integers. Used algorithm is based on the line sieving, however, in order to optimize the efficiency, we adapted a new implementational method (the pipelined sieving). In addition, we actually factored a 423-bit integer in about 30 days with the developed device CAIRN 2 for the sieving step and usual PCs for other steps. As far as the authors know, this is the first FPGA implementation and experiment of the sieving step in NFS.
... The cofactoring stage can be divided into a brute force trial division step to identify small primes (less than 100,000) followed by more complex probabilistic methods such as the rho method, p-1 method, or elliptic curve method (ECM) [5]. Efficient hardware implementation of these methods is a subject of extensive research [5,6,7,8,9,10,11]. Less well studied are techniques to develop an efficient hardware implementation to perform the trial division by small primes. ...
Article
Full-text available
Trial division is the most straightforward way to determine the prime fac- tors of a number, but the execution time is exponentially dependent on the size of the number. We have developed a novel hardware architecture which performs trial division of large dividends by small prime divisors at a much higher throughput than previously reported architectures. Our de- sign is implemented in FPGA devices and provides a speed-up of between one and two orders of magnitude over an optimized software implemen- tation of the same algorithm. These results can be employed to speed up factoring algorithms like the quadratic sieve or number field sieve when implemented in reconfigurable computers.