Fig 1 - uploaded by Ekawat Homsirikamol
Content may be subject to copyright.
Three hardware architectures of a hash function: a) basic iterative: x1, b) folded horizontally by a factor of 2: /2(h), c) folded vertically by a factor of 2: /2(v). R-round, S1, S2-selection functions.

Three hardware architectures of a hash function: a) basic iterative: x1, b) folded horizontally by a factor of 2: /2(h), c) folded vertically by a factor of 2: /2(v). R-round, S1, S2-selection functions.

Source publication
Conference Paper
Full-text available
In this paper we present a comprehensive comparison of all Round 3 SHA-3 candidates and the current standard SHA-2 from the point of view of hardware performance in modern FPGAs. Each algorithm is implemented using multiple architectures based on the concepts of folding, unrolling, and pipelining. Trade-offs between speed and area are investigated,...

Contexts in source publication

Context 1
... starting point for our exploration of various architectures of hash functions is the basic iterative architecture, shown in Fig. 1a. The characteristic features of this architecture are as follows: a) datapath width = state size (denoted by s), b) one round is performed in a single clock cycle, c) only one message is processed at a time. The minimum block processing time is typically given by ...
Context 2
... a round of a hash function has a symmetric structure, with two or more similar operations performed one after another, horizontal folding is possible. In Fig. 1b, horizontal folding by a factor of two is demonstrated. We will denote this architecture by /2(h). ...
Context 3
... case horizontal folding is either not possible or does not achieve the re- quired reduction in area, vertical folding may be attempted. In Fig. 1c, we demonstrate vertical folding by a factor of 2, which we denote by /2(v). In this architecture, the datapath width is reduced by a factor of two. As a result two clock cycles are required to complete a round. In the first clock cycle, only bits of the internal state affecting the first half of the round output are provided to the ...

Similar publications

Article
Full-text available
This paper presents a hardware-efficient scheme to con- struct sensor-based Generalized Voronoi Diagram (GVD) of an indoor environment. An architecture to construct the GVD using a prediction and correction strategy is pre- sented. The approach is based on processing distance infor- mation from ultrasonic sensors. A feature of the proposed approach...
Article
Full-text available
This article described the Advanced Encryption Standard (AES) encryption and decryption process without using lookup tables in the MixColumns transformation and parallelizing the transformation process implemented in the Field Programmable Gate Array (FPGA) hardware. Parallelism of the hardware process conducted to the transformation of key schedul...
Article
Full-text available
Automatic ECG signal characterization is of critical importance in patient monitoring and diagnosis. This process is computationally intensive, and low-power, online (real-time) solutions to this problem are of great interest. In this paper, we present a novel, dedicated hardware implementation of the ECG signal processing chain based on Hermite fu...
Article
Full-text available
Recently, Compressive Sensing (CS) theory based on the traditional CAMP reconstruction algorithm has applied in radar systems to achieve the benefits of CS such as low sampling rate, small memory size, less complexity in hardware and consequently reduces the required processing time as using a low speed Analog-to-Digital Converter. A modified recon...
Conference Paper
Full-text available
Convolutional Neural Networks (CNNs) have achieved impressive performance on various computer vision tasks. To facilitate better performance, some complicated-connected CNN models (e.g., GoogLeNet and DenseNet) have recently been proposed, and have achieved state-of-the-art performance in the fields of image classification and segmentation. However...

Citations

... (4) Throughput to Area Ratio: The comparison between performance metrics is recommended according to a fair factor. Taking the ratio of throughput to the area requirements can provide a fair comparison between different kinds of FPGAs [35]. (5) Power Consumption: Reflects the total amount of power consumed by the hardware when applying a specific design. ...
Article
Cryptographic hash functions are widely used primitives with a purpose to ensure the integrity of data. Hash functions are also utilized in conjunction with digital signatures to provide authentication and non-repudiation services. The SHA has been developed over time by the National Institute of Standards and Technology for security, optimal performance, and robustness. The best-known hash standards are SHA-1, SHA-2, and SHA-3. Security is the most notable criterion for evaluating the hash functions. However, the hardware performance of an algorithm serves as a tiebreaker among the contestants when all other parameters (security, software performance, and flexibility) have equal strength. Field Programmable Gateway Array (FPGA) is a reconfigurable hardware that supports a variety of design options, making it the best choice for implementing the hash standards. In this survey, particular attention is devoted to the FPGA optimization techniques for the three hash standards. The study covers several types of optimization techniques and their contributions to the performance of FPGAs. Moreover, the article highlights the strengths and weaknesses of each of the optimization methods and their influence on performance. We are optimistic that the study will be a useful resource encompassing the efforts carried out on the SHAs and FPGA optimization techniques in a consolidated form.
... Other applications in our top list related to FPGAs are fault detection systems [366][367][368][369][370][371] RSA (Rivest-Shamir-Adleman) is one of the first public-key cryptography algorithm and it is based on the practical difficulty of the factorization of the product of two large prime numbers [ [447]. Figure 8 shows that the implementations in FPGAs for SHA-3 started in 2009 with test and comparatives for SHA-3 candidates [448][449][450][451][452][453]. On October 2012, the Keccak cryptographic function was selected as the winner of the competition [454] and then implementations for the Keccak SHA-3 started in FPGAs [455], such as high throughput/performance [456][457][458][459], IoT focused applications [457] and compact implementations [460]. ...
Article
Full-text available
Field Programmable Gate Array (FPGA) is a general purpose programmable logic device that can be configured by a customer after manufacturing to perform from a simple logic gate operations to complex systems on chip or even artificial intelligence systems. Scientific publications related to FPGA started in 1992 and, up to now, we found more than 70,000 documents in the two leading scientific databases (Scopus and Clarivative Web of Science). These publications show the vast range of applications based on FPGAs, from the new mechanism that enables the magnetic suspension system for the kilogram redefinition, to the Mars rovers’ navigation systems. This paper reviews the top FPGAs’ applications by a scientometric analysis in ScientoPy, covering publications related to FPGAs from 1992 to 2018. Here we found the top 150 applications that we divided into the following categories: digital control, communication interfaces, networking, computer security, cryptography techniques, machine learning, digital signal processing, image and video processing, big data, computer algorithms and other applications. Also, we present an evolution and trend analysis of the related applications.
... In Table 1, we provide a snapshot of the high-speed implementations results for FPGAs from different groups. The comprehensive studies on the above selected algorithms are reported by Baldwin et al. [17], Matsuo et al. [18], Gaj et al. [19], Shahid et al. [20] and Homsirikamol et al. [21, 22]. In [17][18][19][20][21][22], the authors have investigated different design strategies and have implemented various architectures for every algorithm using pipelining, folding and loop unrolling approaches. ...
... The comprehensive studies on the above selected algorithms are reported by Baldwin et al. [17], Matsuo et al. [18], Gaj et al. [19], Shahid et al. [20] and Homsirikamol et al. [21, 22]. In [17][18][19][20][21][22], the authors have investigated different design strategies and have implemented various architectures for every algorithm using pipelining, folding and loop unrolling approaches. However, here we have selected only the results of the basic iterative architecture (x1) for Keccak, basic iterative architecture with 4 unrolled stages (x4) for Skein and basic iterative architecture (x1) with memory for JH. ...
Article
Full-text available
In this work, we present a compact hardware implementation of cryptographic hash algorithms; [Keccak, Skein & JH] on Field Programmable Gate Array (FPGA) by using an efficient primitive level programming approach. All the logic is not only mapped onto Look-Up-Table (LUT) but also effectively utilizes FPGAs internal dedicated logical resource, such as Fast Carry Chain logic with MUXCY and XORCY to reduce overall hardware resources. This approach results in the usage of a minimized chip area with a good balance between resources and speed for selected hash algorithms. All the implementation has been done on the latest Xilinx FPGAs and their results comparisons are presented in the form of chip area consumption, throughput and throughput per area with previous up-to-date implementations. The results show a substantial improvement as compared to all the previously reported works.
... We obtain 302 slices, while the throughput reaches 287.83 Mbps. As can be seen from the Table 1, our work seems to require much less area than most ciphers [17,16,18,15] and also yields a better throughput per area ratio compared to MD5 [17], SHA-l [16], SHA-256 [18], PHOTON-80/20/16 [2,11] and HMAC-PHOTON-80/20/16 [11]. ...
... We obtain 251 slices, while the throughput reaches 171.42 Mbps. As seen from the Table 1, our folding based MAC-PHOTON implementation seems to require much less area than most ciphers [17,16,18,15] and also yields a better throughput per area ratio compared to MD5 [17], SHA-l [16], SHA-256 [18], PHOTON-80/20/16 [2,11] and HMAC-PHOTON-80/20/16 [11]. ...
... We obtain 1066 slices, while the throughput reaches 508.6 Mbps. As seen from the Table 1, our unrolling based MAC-PHOTON implementation seems to require much less area than KECCAK-256 [18,15] and also yields a better throughput per area ratio compared to MD5 [17], SHA-l [16], SHA-256 [18], PHOTON-80/20/16 [2,11] and HMAC-PHOTON-80/20/16 [11]. ...
Conference Paper
Full-text available
\(\mathtt{PHOTON} \) is a lightweight hash function which was proposed by Guo et al. in CRYPTO 2011. This is used in low-resource ubiquitous computing devices such as RFID tags, wireless sensor nodes, smart cards and mobile devices. \(\mathtt{PHOTON} \) is built using sponge construction and it provides a new \(\mathtt{MAC} \) function called \(\mathtt{MAC}-\mathtt{PHOTON} \). This paper deals with FPGA implementations of \(\mathtt{MAC}-\mathtt{PHOTON} \) and their side-channel attack (SCA) resistance. First, we describe three architectures of the \(\mathtt{MAC}-\mathtt{PHOTON} \) based on the concepts of iterative, folding and unrolling, and we provide their performance results on the Xilinx Virtex-5 FPGAs. Second, we analyse security of the \(\mathtt{MAC}-\mathtt{PHOTON} \) against side-channel attack using a SASEBO-GII development board. Finally, we present an analysis of its Threshold Implementation (TI) and discuss its resistance against first-order power analysis attacks.
... Table II is the extra area needed when integrating with customized hardware components. The design proposed in [11] is around two times faster than our design, but it has double the area as compared to our design. Our design can support up to four KECCAK algorithm with different parameters, where as Homsirikamol's designs are only targeted for KECCAK-256 and KECCAK-512. ...
Conference Paper
Full-text available
As embedded systems play more and more important roles Internet of Things (IoT), the integration of cryptographic functionalities is an urgent demand to ensure data and information security. Recently, KECCAK was declared as the winner of the third generation of Secure Hashing Algorithm (SHA-3). However, implementing SHA-3 on a specific 32-bit processor failed to meet the performance requirement. On the other hand, implementing it as a cryptographic coprocessor consumes a lot of extra area and requires customized driver program. Although implementing KECCAK on a 64-bit platform is more efficient, this platform is not suitable for embedded implementation. In this paper, we propose a novel SHA-3 implementation using instruction set extension based on a 32-bit LEON3 processor (an open source processor), with the goals of reducing execution cycles and code size. Experimental results show that the proposed design reduces around 87% execution cycles and 10.5% code size as compared to reference designs. Our design takes up only 9.44% extra area with negligible speed overhead compared to the standard LEON3 processor. Compared to the existing hardware accelerators, our proposed design occupies only half of area resources and does not require extra driver programs to be developed when integrated into the overall system.
... The work in [6] uses an iterative architecture in a Virtex -5 device that achieves a throughput up to 818 Mbps at 115 MHz. [6] 115 818 [7] 200 223 [8] 276 80 [9] -1325 Proposed 120 975 ...
... The proposed implementations outperform all the previous implementations in term of time performance. Finally, in [9] a pipeline architecture was implemented with result a high speed implementation with throughput up to 1325 Mbps. ...
Conference Paper
Full-text available
Skein is a hash function that reached the semifinals of the NIST competition for the selection of standard SHA-3. This paper describes the implementation of Skein-512 operating as simple hash function and as MAC function. The design was coded using VHDL language and for the hardware implementation, two XILINX FPGAs, Virtex-6 and Virtex-7 were used. The proposed implementation reaches a data throughput of 894 Mbps at 110 MHz clock frequency for Virtex-6 and a throughput of 975 Mbps at 120 MHz clock frequency for Virtex-7.
... required area, throughput, clock frequency, power consumption, energy consumption). These trade-offs can be done by controlling the degree of parallelism [10] [11] [12] [13] [14] [15], using different architecture [16] [17], and using different algorithms [18] [19]. Given this flexibility, we can have multiple hardware implementations for executing a task on FPGA. ...
... Please cite this article in press as: Marconi T. Online scheduling and placement of hardware tasks with multiple variants on dynamically reconfigurable field-programmable gate arrays. Comput Electr Eng (2013), http://dx.doi.org/10.1016/j.compeleceng.2013.07.004 by controlling the degree of parallelism [10] [11] [12] [13] [14] [15], implementing hardwares using different architectures [16] [17] and algorithms [18] [19]. Some of known techniques to control the degree of parallelism are loop unrolling with different unrolling factors [10] [11] [12] [13], recursive variable expansion (RVE) [14], and realizing hardwares with different pipeline stages [15]. ...
... For example, our adder outperform, in terms of latency · area, both classical designs: [38]. The heuristic method "GMU Optimization 1" helped to achieve the best to date hardware implementation results in multiple our papers, including [39], [21], [40], [33], [28], [36] and [29]. ...
... On October 2, 2012, Keccak [45] has been announced to be the winner of the NIST hash function competition [46]. This algorithm has demonstrated medium speed in software implementations [47], [48], and the best results in terms of hardware efficiency for both single stream [43] and multiple streams of data in hardware implementations [39]. ...
... An analysis of the influence of the Round 3 tweaks in Grøstl on the performance of this algorithm in FPGAs was conducted in [71]. Comprehensive hardware evaluation across multiple architectures for all SHA-3 finalists, including Grøstl, was investigated in [39] and [81]. The implementation results of hardware architectures, for a single stream of data, in both variants of Grøstl-256 are summarized in Table 2.1. ...
Thesis
Full-text available
Cryptography is a very active branch of science. Due to the everlasting struggle between cryptographers, designing new algorithms, and cryptanalysts, attempting to break them, the cryptographic standards are constantly evolving. In the period 2007-2012, the National Institute of Standards and Technology (NIST) held a competition to select a new cryptographic hash function standard, called SHA-3. The major outcome of this contest, apart from the winner - Keccak, is a strong portfolio of cryptographic hash functions. One of the five final SHA-3 finalists, Groestl, has been inspired by Advanced Encryption Standard (AES), and thus can share hardware resources with AES. As a part of this thesis, we have developed a new hardware architecture fora high-speed coprocessor supporting HMAC (Hash Message Authentication Code) based on Groestl and AES in the counter mode. Both algorithms provide efficient hardware acceleration for the authenticated encryption functionality, used in multiple practical security protocols (e.g., IPSec, SSL, and SSH). Our coprocessor outperforms the most competitive design by Jarvinen in terms of the throughput and throughput/area ratio by 133\% and 64\%, respectively. Pairing-based cryptography has emerged as an important alternative and supplement to traditional public key cryptography. Pairing-based schemes can be used for identity-based encryption, tripartite key exchange protocols, short signatures, identity-based signatures, cryptanalysis, and many other important applications. Compared to other popular public key cryptosystems, such as ECC and RSA, pairing-based schemes are much more computationally intensive. Therefore, hardware acceleration based on modern high-performance FPGAs is an important implementation option. Pairing-schemes over prime fields are considered particularly resistant to cryptanalysis, but at the same time, the most challenging to implement in hardware. One of the most promising optimization options is taking advantage of embedded resources of modern FPGAs. Practically all FPGA vendors incorporate in modern FPGAs, apart from basic reconfigurable logic blocks, also embedded components, such as DSP units, Fast Carry Chain Adders, and large memory blocks. These hardwired FPGA resources, together with meticulously selected prime numbers, such as Mersenne, Fermat, or Solinas primes, can serve as a basis of an efficient hardware implementation. In this work, we demonstrate a novel high-speed architecture for Tate pairing over prime fields, based on the use of Solinas primes, Fast Carry Chains, and DSP units of modern FPGAs. Our architecture combines Booth recoding, Barrett modular reduction, and the high-radix carry-save representation in the new design for modular multiplication over Solinas primes. Similarly, a low-latency modular adder, based on high-radix carry save addition, Fast Carry Chains, and the Kogge-Stone architecture, has been proposed. The modular multiplier and adder based on the aforementioned principles have been used as basic building blocks for a higher level application - a high-speed hardware accelerator for Tate pairing on twisted supersingular Edwards curves over prime fields. The fastest version of our design calculates Tate pairing at the 80, 120 and 128-bit security level over prime fields in 0.13, 0.54 and 0.70 ms, respectively. It is the fastest pairing implementation over prime fields in the 120-128-bit security range. Apart of the properly designed architectures for cryptographic algorithms, one more ingredient contributes to the success of a hardware coprocessor for any application - an electronic design automation software and its set of options. Concerning this issue, Cryptographic Engineering Research Group (CERG) at Mason has developed an open-source environment, called ATHENa (Automated Tool for Hardware EvaluatioN), for fair, comprehensive, automated, and collaborative hardware benchmarking and optimization of algorithms implemented in FPGAs. One of the contributions of this thesis is the design of the heart of ATHENa: its most efficient heuristic optimization algorithm, called GMU_Optimization_1. As a basis of its development, multiple comprehensive experiments have been conducted. This algorithm has been demonstrated to provide up to 100\% improvement in terms of the throughput to area ratio, when applied to 14 SHA-3 Round 2 candidates. Additionally, our optimization strategy is applicable to the optimization of dedicated hardware in any other area of science and engineering.
... Regarding the hardware implementations of Skein, the works presented in literature can be classified in two main categories. The first category includes the works that perform comparative studies among the candidates of NIST's hash competition [24][25][26][27][28][29][30][31][32][33][34][35][36][37][38][39][40][41]. The main goal of these works is not to develop sophisticated architectures but to study the performance of these algorithms when they are implemented in hardware. ...
... The straightforward design of Skein-512 function corresponds to one MIX-permute pair (with no unrolling) plus the appropriate 64-bit adders for the keys' additions. Due to the fact that the subkeys are inserted every four rounds, another approach is the 4-round unrolled alternative that is followed in many works [27], [29], [36][37][38], [41]. Additionally, the fact that the round constants, R, are different for the first eight rounds and then they are repeated every eight rounds, the third alternative for unrolling is the 8-round unrolled one, which is also followed by many researchers, as well as the creators of the Skein family [17], [22], [37][38], [41]. ...
... Due to the fact that the subkeys are inserted every four rounds, another approach is the 4-round unrolled alternative that is followed in many works [27], [29], [36][37][38], [41]. Additionally, the fact that the round constants, R, are different for the first eight rounds and then they are repeated every eight rounds, the third alternative for unrolling is the 8-round unrolled one, which is also followed by many researchers, as well as the creators of the Skein family [17], [22], [37][38], [41]. Due to the above facts, all the intermediate possible alternatives with different unrolling factor are considered as non-effective in terms of area because they demand more steering logic for the sub-keys and the rotation constants resulting in an increase of the critical path. ...
... Previously different design methodology were focused on exploring high throughput to area ratio architectures, either by reducing area or by increasing throughput. Three different architectures were discuss in [9], which are parallel processing of P and Q, unrolling of both P and Q and using nstages pipelining, can be used for high throughput implementation. In unrolling increment in area is too high as compare to throughput, while pipelining can increase latency of the design. ...
Article
In 2007 NIST announced a public competition to develop a new cryptographic hash algorithm. This competition was announced due to the fact that in recent years, several successful attacks have been reported against SHA-1, thus raised significant alarming conditions against SHA-2. This new algorithm will replace the SHA-2 and can be used in various security applications in the information infrastructure. This paper focuses on efficient implementation of one of the SHA-3 candidates and round-3 finalist Grøstl on FPGA. The aim of this work is to achieve high throughput to area ratio (TPA) simultaneously by achieving high throughput by considering tradeoff between area and speed. The design is implemented as fully autonomous with both permutations P and Q are executed in parallel, and are equipped with I/O wrapper. The developed hardware has two designs, first with S-box is implemented using Look-Up-Table (LUT) or Distributed Memory and second with S-box implemented as Block RAM (BRAM). The implementation results obtained using virtex-5, when S-box is implemented as LUT has a throughput of 9.360Gbps and occupied 2253 Slices including I/O wrapper, thus achieves TPA of 4.154 and when S-box implemented as BRAM has throughput of 5.565Gbps and occupied 1356 Slices with wrapper, thus achieves 4.104 throughput per unit area (TPA). 1.