Article

Fast hashing on pentium SIMD architecture /

Authors:

Printout. Thesis (M.S.)--Oregon State University, 2005. Includes bibliographical references (leaves 37-38).

Parallel SHA-256 on SW26010 many-core processor for hashing of multiple messages

Article

Full-text available

Aug 2022
J SUPERCOMPUT

To explore whether new parallelism techniques can provide additional performance improvements in cryptographic hash functions, we conducted our study with the SW26010, which is a special-architecture processor on Sunway TaihuLight, one of the world’s fastest supercomputers. Secure Hash Algorithms (SHAs) are significant for secure transmission, with SHA-256 remaining a safe and most efficient SHA design. We propose SW-SHA-256, a parallel SHA-256 implementation for hashing of multiple messages on the SW26010. Our work explores the parallel schemes at the instruction and thread levels. At the instruction level, we use vector registers to load multiple messages to complete hashing simultaneously. Assembly-level optimization methods such as dual issue are used, and the pipeline is distinct from that of a general-purpose processor. At the thread level, the optimized DMA transmission strategy and double buffer technique are used to reduce the cost from memory to cache. As a result, we obtain 5.87 cycles per byte in a single core which is 8.18X speed up faster than the C code in OpenSSLv3.0.0. Moreover, our implementation achieves a throughput of 60.21 GB/s on a SW26010 processor and is highly scalable.

SoK: A Performance Evaluation of Cryptographic Instruction Sets on Modern Architectures

Conference Paper

Full-text available

May 2018

The latest processors have included extensions to the instruction set architecture tailored to speed up the execution of cryptographic algorithms. Like the AES New Instructions (AES-NI) that target the AES encryption algorithm, the release of the SHA New Instructions (SHA-NI), designed to support the SHA-256 hash function, introduces a new scenario for optimizing cryptographic software. In this work, we present a performance evaluation of several cryptographic algorithms, hash-based signatures and data encryption, on platforms that support AES-NI and/or SHA-NI. In particular, we revisited several optimization techniques targeting multiple-message hashing, and as a result, we reduce by 21% the running time of this task by means of a pipelined SHA-NI implementation. In public-key cryptography, multiple-message hashing is one of the critical operations of the XMSS and XMSS^MT post-quantum hash-based digital signatures. Using SHA-NI extensions, signatures are computed 4x faster; however, our pipelined SHA-NI implementation increased this speedup factor to 4.3x. For symmetric cryptography, we revisited the implementation of AES modes of operation and reduced by 12% and 7% the running time of CBC decryption and CTR encryption, respectively.

Simultaneous Hashing of Multiple Messages

Article

Full-text available

Jan 2012

We describe a method for efficiently hashing multiple messages of different lengths. Such computations occur in various scenarios, and one of them is when an operating system checks the integrity of its components during boot time. These tasks can gain performance by parallelizing the computations and using SIMD architectures. For such scenarios, we compare the performance of a new 4-buffers SHA-256 S-HASH implementation, to that of the standard serial hashing. Our results are measured on the 2 nd Generation Intel ® Core™ Processor, and demonstrate SHA-256 processing at effectively ~5.2 Cycles per Byte, when hashing from any of the three cache levels, or from the system memory. This represents speedup by a factor of 3.42x compared to OpenSSL (1.0.1), and by 2.25x compared to the recent and faster n-SMS method. For hashing from a disk, we show an effective rate of ~6.73 Cycles/Byte, which is almost 3 times faster than OpenSSL (1.0.1) under the same conditions. These results indicate that for some usage models, SHA-256 is significantly faster than commonly perceived.

A Review of SIMD Multimedia Extensions and their Usage in Scientific and Engineering Applications

Article

Nov 2008

The volume and complexity of data processed by today's personal computers are increasing exponentially, placing incredible demands on the microprocessors. In the meantime, computing performance that can be achieved by increasing the clock speed of a microprocessor is reaching to physical limits thus making the architectural solutions more prominent. Due to this an important architectural feature is added to recent microprocessors, single instruction multiple data (SIMD), which is a set of instructions that can speed up an application performance by allowing basic operation to be performed on multiple data elements in parallel with fewer instructions. The SIMD computational technique was introduced in the IA-32 Intel® architecture with MMX technology and then further enhanced with Intel's introduction of streaming SIMD extensions (SSE), SSE 2 (SSE2) and SSE 3 (SSE3). Although programming using these SIMD extensions enables software to achieve higher performance, several exiting scientific applications are not affected. This paper gives an overview of SIMD multimedia extensions. The features of these extensions are introduced. Available methods for programming with multimedia instruction sets are discussed. It also reviews recent trends to use multimedia extensions to accelerate many applications such as multimedia, scientific and engineering applications, and argues for further use in other significant computationally intensive applications.

Efficient SIMD Numerical Interpolation

Conference Paper

Sep 2005
Lect Notes Comput Sci

This paper reports the results of SIMD implementation,of a number ofinterpolation algorithms,on common,personal computers. These methods,fit acurve,on some,given input points for which a mathematical,function form is not known. We have implemented,four widely used methods,using vector proc- essing capabilities embedded,in Pentium,processors. By using,SSE (streaming SIMD extension) we could ,perform ,all operations on four ,packed ,single- precision (32-bit) floating point values simultaneously. Therefore, the running time decreases three times or even more depending on the number,of points and the interpolation method. We have implemented,four interpolation methods,us- ing SSE technology then analyzed their speedup as a function of the number,of points being interpolated. A comparison ,between ,characteristics of developed vector algorithms is also presented.

Inner loops: A sourcebook for fast 32-bit software development

Article

Jan 1997

Rick Booth

The Complete Guide to MMX Technology

Article

Jan 1997

Fast Hashing on the Pentium

Conference Paper

Aug 1996

With the advent of the Pentium processor parallelization finally bccarne available to Intel based computer systems. One of the design principles of the MD4-family of hash functions (MD4, MD5, SHA-1, FLIPEMD-160) is to be fast on the 32-bit Intel processors. This paper shows that carefully coded implementations of these hash functions are able to exploit the Pentium's superscalar architecture to its maximum effect: the performance with respect to execution on a non-parallel architecture increases by about 60%. This is an important result in view of the recent claims on the limited data bandwidth of these hash functions. Moreover, it is conjectured that these implementations are very close to optimal. It will also be shown that the performance penalty incurred by non-cached data and endianness conversion is limited, and in the order of 10% of running time.

Cryptonite – A Programmable Crypto Processor Architecture for High-Bandwidth Applications

Conference Paper

Mar 2004

Cryptographic methods are widely used within networking and digital rights management. Numerous algorithms exist, e.g. spanning VPNs or distribut- ing sensitive data over a shared network infrastructure. While these algorithms can be run with moderate performance on general purpose processors, such pro- cessors do not meet typical embedded systems requirements (e.g. area, cost and power consumption). Instead, specialized cores dedicated to one or a combination of algorithms are typically used. These cores provide very high bandwidth data transmission and meet the needs of embedded systems. However, with such cores changing the algorithm is not possible without replacing the hardware. This paper describes a fully programmable processor architecture which has been tailored for the needs of a spectrum of cryptographic algorithms and has been explicitly designed to run at high clock rates while maintaining a significantly better perfor- mance/area/power tradeoff than general purpose processors. Both the architecture and instruction set have been developed to achieve a bits-per-clock rate of greater than one, even with complex algorithms. This performance will be demonstrated with standard cryptographic algorithms (AES and DES) and a widely used hash algorithm (MD5).

Pentium Processor System Architecture

Book

Jan 1995

Towards High Performance Cryptographic Software

Conference Paper

Sep 1995

Not Available

Even Faster Hashing on the Pentium

Article

Feb 2002

Antoon Bosselaers

In this short note we present an improvement of about 15% over our performance figures for the MD4-family of hash functions as presented at Crypto'96. The improvement is obtained by substituting n-cycle instructions by n 1-cycle instructions, and reducing the number of instructions by means of the super-add instruction lea, thereby carefully avoiding the dreaded AGI.

SHA: A Design for Parallel Architectures?

Article

Feb 2002

To enhance system performance computer architectures tend to incorporate an increasing number of parallel execution units. This paper shows that the new generation of MD4-based customized hash functions (RIPEMD-128, RIPEMD-160, SHA-1) contains much more software parallelism than any of these computer architectures is currently able to provide. It is conjectured that the parallelism found in SHA-1 is a design principle. The critical path of SHA-1 is twice as short as that of its closest contender RIPEMD-160, but realizing it would require a 7-way multiple-issue architecture. It will also be shown that, due to the organization of RIPEMD-160 in two independent lines, it will probably be easier for future architectures to exploit its software parallelism.

A Fast Portable Implementation of the Secure Hash Algorithm, III

Article

Jun 2001

Kevin S. McCurley

In 1992, NIST announced a proposed standard for a collision-free hash function. The algorithm for producing the hash value is known as the Secure Hash Algorithm (SHA), and the standard using the algorithm in known as the Secure Hash Standard (SHS). Later, an announcement was made that a scientist at NSA had discovered a weakness in the original algorithm. A revision to this standard was then announced as FIPS 180-1, and includes a slight change to the algorithm that eliminates the weakness. This new algorithm is called SHA-1. In this report we describe a portable and efficient implementation of SHA-1 in the C language. Performance information is given, as well as tips for porting the code to other architectures. We conclude with some observations on the efficiency of the algorithm, and a discussion of how the efficiency of SHA might be improved.

The Software Optimization Cookbook

Jan 2002

R Gerber

R. Gerber. The Software Optimization Cookbook. Intel Press, 2002.

The Indispensable Pentium Book

Jan 1995

H.-P Messmer

H.-P. Messmer. The Indispensable Pentium Book. Addison-Wesley, 1995.

SHA1, SHA2, HMAC and Key Derivation in C

Jan 2004

B Gladman

B. Gladman. SHA1, SHA2, HMAC and Key Derivation in C, January 2004. http: //fp.gladman.plus.com/cryptographytechnology/sha/index.htm.

Jan 2003

W Dai

W. Dai. Crypto++ Library 5.1, 2003. http://www.eskimo.com/ weidai/cryptlib.html.

National Institute for Standards and Technology. Specifications for the secure hash standard. FIPS Publication 180-1

Feb 1996
145-154

E Nahum
D Yates
S O'malley
H Orman
R Schroeppel

E. Nahum, D. Yates, S. O'Malley, H. Orman, and R. Schroeppel. Parallelized network security protocols. In Internet Society Symposium on Network and Distributed System Security, pages 145-154, San Diego, California, February 1996. IEEE Computer Society Press. National Institute for Standards and Technology. Specifications for the secure hash standard. FIPS Publication 180-1, April 2002. National Institute for Standards and Technology. Specifications for the secure hash standard. FIPS Publication 180-2, August 2002.

Software Optimization for High Performance Computing

Jan 2000

K R Wadleigh
I L Crawford

K. R. Wadleigh and I. L. Crawford. Software Optimization for High Performance Computing. Prentice-Hall, 2000.

Fast hashing on pentium SIMD architecture /

Abstract

No full-text available

Recommended publications

Sezincote : the "hindoo" aesthetic in British architecture, 1795-1820 /

A partial inventory of the work of Emil Schacht [microform] : architect in Portland, Oregon from 188...

Symmetric encryption with multiple keys : techniques and applications /

A fast and efficient non-blocking coordinated checkpointing approach for mobile computing systems /

Carlos Vierra and the origins of the Spanish Pueblo Revival Style /

Sensitivity analysis and architectural comparison of narrow-band sharp-transition digital filters /

Implementation of fast algorithms for solution of partial differential equations on high performance...

Program architecture for natural language dialog /

Fast operating system emulation on an exokernel architecture

The soft controller : a self-timed microsequencer for distributed parallel architectures /

Fast and efficient memory allocators /

Object oriented simulation tools for discrete-continuous, stochastic-deterministic simulation models...