Article

Fast hashing on pentium SIMD architecture /

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Printout. Thesis (M.S.)--Oregon State University, 2005. Includes bibliographical references (leaves 37-38).

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Simultaneous multi-message hashing (data parallelism) can be utilized to improve efficiency for these structures. Data-parallel hashing can achieve a high degree of parallelism [10]. There is some other work to speed up single-message hashing. ...
... SHA-256 parallel acceleration attempts in the early years were mostly focused on CPU platforms. The first SIMD-based implementation of hash algorithms was proposed in 2004 [10]. Parallel optimization was tried using two methods: 1) Parallelize some computations of the message schedule when hashing a single message. ...
... The originally defined Boolean function is not optimal. We use the redefining Boolean function in [10] to reduce the number of instructions. Table 2 shows the original definition and new definition of two Boolean functions. ...
Article
Full-text available
To explore whether new parallelism techniques can provide additional performance improvements in cryptographic hash functions, we conducted our study with the SW26010, which is a special-architecture processor on Sunway TaihuLight, one of the world’s fastest supercomputers. Secure Hash Algorithms (SHAs) are significant for secure transmission, with SHA-256 remaining a safe and most efficient SHA design. We propose SW-SHA-256, a parallel SHA-256 implementation for hashing of multiple messages on the SW26010. Our work explores the parallel schemes at the instruction and thread levels. At the instruction level, we use vector registers to load multiple messages to complete hashing simultaneously. Assembly-level optimization methods such as dual issue are used, and the pipeline is distinct from that of a general-purpose processor. At the thread level, the optimized DMA transmission strategy and double buffer technique are used to reduce the cost from memory to cache. As a result, we obtain 5.87 cycles per byte in a single core which is 8.18X speed up faster than the C code in OpenSSLv3.0.0. Moreover, our implementation achieves a throughput of 60.21 GB/s on a SW26010 processor and is highly scalable.
... This task is also known in the literature as multi-buffer hashing, simultaneous hashing, or n-way hashing [1,13,17]. This workload can be efficiently performed using parallel computing techniques, such as the multi-core processing, the SIMD vectorization, and the pipelining instruction scheduling. ...
... For instance, Gueron [17] showed the n-SMS technique that parallelizes the message schedule phase leading to a faster single message hashing. Following the work of Aciicmez [1], Gueron and Krasnov [18] reported an implementation that uses the SSE unit to compute four SHA-256 digests simultaneously; as a result, their implementation runs 2.2× faster than the n-SMS single-message implementation of Gueron [17]. They also extended the parallelization to 256-bit registers, however, although AVX2 was not available at the time their work was published, their estimations accurately matched the performance exhibited by AVX2-ready processors available nowadays. ...
Conference Paper
Full-text available
The latest processors have included extensions to the instruction set architecture tailored to speed up the execution of cryptographic algorithms. Like the AES New Instructions (AES-NI) that target the AES encryption algorithm, the release of the SHA New Instructions (SHA-NI), designed to support the SHA-256 hash function, introduces a new scenario for optimizing cryptographic software. In this work, we present a performance evaluation of several cryptographic algorithms, hash-based signatures and data encryption, on platforms that support AES-NI and/or SHA-NI. In particular, we revisited several optimization techniques targeting multiple-message hashing, and as a result, we reduce by 21% the running time of this task by means of a pipelined SHA-NI implementation. In public-key cryptography, multiple-message hashing is one of the critical operations of the XMSS and XMSS^MT post-quantum hash-based digital signatures. Using SHA-NI extensions, signatures are computed 4x faster; however, our pipelined SHA-NI implementation increased this speedup factor to 4.3x. For symmetric cryptography, we revisited the implementation of AES modes of operation and reduced by 12% and 7% the running time of CBC decryption and CTR encryption, respectively.
... A SIMD based implementation of hash algorithms was first proposed (in 2004) and described in detail by Aciiçmez [ 1]. He studied the computations of SHA-1, SHA-256 and SHA-512, and his investigation was carried out on Intel ® Pentium™ 4, using SSE2 instructions. ...
... For the experiments, we prepared two directories with a different combination of files. The first directory (DIVERSE hereafter) contained 350 files occupying 79MB (82,833,132 bytes) in total (1) . The files sizes range from 3 Bytes to 7.18MB (7,533,568 bytes), with the average size of 0.22MB (236,666 bytes). ...
Article
Full-text available
We describe a method for efficiently hashing multiple messages of different lengths. Such computations occur in various scenarios, and one of them is when an operating system checks the integrity of its components during boot time. These tasks can gain performance by parallelizing the computations and using SIMD architectures. For such scenarios, we compare the performance of a new 4-buffers SHA-256 S-HASH implementation, to that of the standard serial hashing. Our results are measured on the 2 nd Generation Intel ® Core™ Processor, and demonstrate SHA-256 processing at effectively ~5.2 Cycles per Byte, when hashing from any of the three cache levels, or from the system memory. This represents speedup by a factor of 3.42x compared to OpenSSL (1.0.1), and by 2.25x compared to the recent and faster n-SMS method. For hashing from a disk, we show an effective rate of ~6.73 Cycles/Byte, which is almost 3 times faster than OpenSSL (1.0.1) under the same conditions. These results indicate that for some usage models, SHA-256 is significantly faster than commonly perceived.
... In [77], four versions of Secure Hashing Algorithms (SHA), namely SHA-1, SHA-256, SHA-384 and SHA-512, are analyzed to determine possible performance gains that can be achieved using SIMD operations, and performed on integers. The author pointed out the appropriate parts of each algorithm, where SIMD instructions can be used, and showed that each SHA algorithm has a great potential to boost both its speed and throughput using SIMD technology. ...
Article
The volume and complexity of data processed by today's personal computers are increasing exponentially, placing incredible demands on the microprocessors. In the meantime, computing performance that can be achieved by increasing the clock speed of a microprocessor is reaching to physical limits thus making the architectural solutions more prominent. Due to this an important architectural feature is added to recent microprocessors, single instruction multiple data (SIMD), which is a set of instructions that can speed up an application performance by allowing basic operation to be performed on multiple data elements in parallel with fewer instructions. The SIMD computational technique was introduced in the IA-32 Intel® architecture with MMX technology and then further enhanced with Intel's introduction of streaming SIMD extensions (SSE), SSE 2 (SSE2) and SSE 3 (SSE3). Although programming using these SIMD extensions enables software to achieve higher performance, several exiting scientific applications are not affected. This paper gives an overview of SIMD multimedia extensions. The features of these extensions are introduced. Available methods for programming with multimedia instruction sets are discussed. It also reviews recent trends to use multimedia extensions to accelerate many applications such as multimedia, scientific and engineering applications, and argues for further use in other significant computationally intensive applications.
... Parallelism is one of the most practical approaches to increase the performance of different interpolation methods [3, 4, 5]. Since the arrival of modern microprocessors with single-instruction multiple-data stream (SIMD) processing extensions, little effort has been made to increase the performance of time-consuming algorithms using the SIMD computational model [1, 2, 10] . In the SIMD model, processors perform one instruction on multiple data operands instead of one operand as scalar processors do. ...
Conference Paper
This paper reports the results of SIMD implementation,of a number ofinterpolation algorithms,on common,personal computers. These methods,fit acurve,on some,given input points for which a mathematical,function form is not known. We have implemented,four widely used methods,using vector proc- essing capabilities embedded,in Pentium,processors. By using,SSE (streaming SIMD extension) we could ,perform ,all operations on four ,packed ,single- precision (32-bit) floating point values simultaneously. Therefore, the running time decreases three times or even more depending on the number,of points and the interpolation method. We have implemented,four interpolation methods,us- ing SSE technology then analyzed their speedup as a function of the number,of points being interpolated. A comparison ,between ,characteristics of developed vector algorithms is also presented.
Conference Paper
With the advent of the Pentium processor parallelization finally bccarne available to Intel based computer systems. One of the design principles of the MD4-family of hash functions (MD4, MD5, SHA-1, FLIPEMD-160) is to be fast on the 32-bit Intel processors. This paper shows that carefully coded implementations of these hash functions are able to exploit the Pentium's superscalar architecture to its maximum effect: the performance with respect to execution on a non-parallel architecture increases by about 60%. This is an important result in view of the recent claims on the limited data bandwidth of these hash functions. Moreover, it is conjectured that these implementations are very close to optimal. It will also be shown that the performance penalty incurred by non-cached data and endianness conversion is limited, and in the order of 10% of running time.
Conference Paper
Cryptographic methods are widely used within networking and digital rights management. Numerous algorithms exist, e.g. spanning VPNs or distribut- ing sensitive data over a shared network infrastructure. While these algorithms can be run with moderate performance on general purpose processors, such pro- cessors do not meet typical embedded systems requirements (e.g. area, cost and power consumption). Instead, specialized cores dedicated to one or a combination of algorithms are typically used. These cores provide very high bandwidth data transmission and meet the needs of embedded systems. However, with such cores changing the algorithm is not possible without replacing the hardware. This paper describes a fully programmable processor architecture which has been tailored for the needs of a spectrum of cryptographic algorithms and has been explicitly designed to run at high clock rates while maintaining a significantly better perfor- mance/area/power tradeoff than general purpose processors. Both the architecture and instruction set have been developed to achieve a bits-per-clock rate of greater than one, even with complex algorithms. This performance will be demonstrated with standard cryptographic algorithms (AES and DES) and a widely used hash algorithm (MD5).
Article
In this short note we present an improvement of about 15% over our performance figures for the MD4-family of hash functions as presented at Crypto'96. The improvement is obtained by substituting n-cycle instructions by n 1-cycle instructions, and reducing the number of instructions by means of the super-add instruction lea, thereby carefully avoiding the dreaded AGI.
Article
To enhance system performance computer architectures tend to incorporate an increasing number of parallel execution units. This paper shows that the new generation of MD4-based customized hash functions (RIPEMD-128, RIPEMD-160, SHA-1) contains much more software parallelism than any of these computer architectures is currently able to provide. It is conjectured that the parallelism found in SHA-1 is a design principle. The critical path of SHA-1 is twice as short as that of its closest contender RIPEMD-160, but realizing it would require a 7-way multiple-issue architecture. It will also be shown that, due to the organization of RIPEMD-160 in two independent lines, it will probably be easier for future architectures to exploit its software parallelism.
Article
In 1992, NIST announced a proposed standard for a collision-free hash function. The algorithm for producing the hash value is known as the Secure Hash Algorithm (SHA), and the standard using the algorithm in known as the Secure Hash Standard (SHS). Later, an announcement was made that a scientist at NSA had discovered a weakness in the original algorithm. A revision to this standard was then announced as FIPS 180-1, and includes a slight change to the algorithm that eliminates the weakness. This new algorithm is called SHA-1. In this report we describe a portable and efficient implementation of SHA-1 in the C language. Performance information is given, as well as tips for porting the code to other architectures. We conclude with some observations on the efficiency of the algorithm, and a discussion of how the efficiency of SHA might be improved.
The Software Optimization Cookbook
  • R Gerber
R. Gerber. The Software Optimization Cookbook. Intel Press, 2002.
The Indispensable Pentium Book
  • H.-P Messmer
H.-P. Messmer. The Indispensable Pentium Book. Addison-Wesley, 1995.
SHA1, SHA2, HMAC and Key Derivation in C
  • B Gladman
B. Gladman. SHA1, SHA2, HMAC and Key Derivation in C, January 2004. http: //fp.gladman.plus.com/cryptographytechnology/sha/index.htm.
  • W Dai
W. Dai. Crypto++ Library 5.1, 2003. http://www.eskimo.com/ weidai/cryptlib.html.
National Institute for Standards and Technology. Specifications for the secure hash standard. FIPS Publication 180-1
  • E Nahum
  • D Yates
  • S O'malley
  • H Orman
  • R Schroeppel
E. Nahum, D. Yates, S. O'Malley, H. Orman, and R. Schroeppel. Parallelized network security protocols. In Internet Society Symposium on Network and Distributed System Security, pages 145-154, San Diego, California, February 1996. IEEE Computer Society Press. National Institute for Standards and Technology. Specifications for the secure hash standard. FIPS Publication 180-1, April 2002. National Institute for Standards and Technology. Specifications for the secure hash standard. FIPS Publication 180-2, August 2002.
Software Optimization for High Performance Computing
  • K R Wadleigh
  • I L Crawford
K. R. Wadleigh and I. L. Crawford. Software Optimization for High Performance Computing. Prentice-Hall, 2000.