Speedup of MMX and MMMX over the C implementation for different kernels with block size 16 × 16 on 1-way issue out-of-order processor.

Source publication

Limitations of special-purpose instructions for similarity measurements in media SIMD extensions

Conference Paper

Full-text available

Oct 2006

Microprocessor vendors have provided special-purpose in- structions such as psadbw and pdist to accelerate the sum- of-absolute differences (SAD) similarity measurement. The usefulness of these special-purpose instructions is limited ex- cept for the motion estimation kernel. This has several draw- backs. First, if the SAD becomes obsolete because...

Optimization of Content-Based Image Retrieval Functions

Conference Paper

Full-text available

Dec 2008

SIMD programming using Intel vector extensions

Article

Sep 2019
J PARALLEL DISTR COM

Performance Evaluation of Implicit and Explicit SIMDization

Article

Sep 2018
MICROPROCESS MICROSY

Processor vendors have been expanding Single Instruction Multiple Data (SIMD) extensions to exploit data-level-parallelism in their General Purpose Processors (GPPs). Each SIMD technology such as Streaming SIMD Extensions (SSE) and Advanced Vector eXtensions (AVX) has its own Instruction Set Architecture (ISA) which equipped with Special Purpose Instructions (SPIs). In order to exploit these features, many programming approaches have been developed. Intrinsic Programming Model (IPM) is a low-level concept for explicit SIMDization. Besides, Compiler’s Automatic Vectorization (CAV) has been embedded in modern compilers such as Intel C++ compiler (ICC), GNU Compiler Collections (GCC) and LLVM for implicit vectorization. Each SIMDization shows different improvements because of different SIMD ISAs, vector register width, and programming approaches. Our goal in this paper is to evaluate the performance of explicit and implicit vectorization. Our experimental results show that the behavior of explicit vectorization on different compilers is almost the same compared to implicit vectorization. IPM improves the performance more than CAVs. In general, ICC and GCC compilers can more efficiently vectorize kernels and use SPI compared to LLVM. In addition, AVX2 technology is more useful for small matrices and compute-intensive kernels compared to large matrices and data-intensive kernels because of memory bottlenecks. Furthermore, CAVs fail to vectorize kernels which have overlapping and non-consecutive memory access patterns. The way of implementation of a kernel impacts the vectorization. In order to understand what kind of scalar implementations of an algorithm is suitable for vectorization, an approach based on code modification technique is proposed. Our experimental results show that scalar implementations that have either loop collapsing, loop unrolling, software pipelining, or loop exchange techniques can be efficiently vectorized compared to straightforward implementations.

A fast normalized cross-correlation calculation method for motion estimation

Article

Full-text available

Jul 2010
IEEE T ULTRASON FERR

High-precision motion estimation has become essential in ultrasound-based techniques such as time-domain Doppler and elastography. Normalized cross-correlation (NCC) has been shown as one of the best motion estimators. However, a significant drawback is its associated computational cost, especially when RF signals are used. In this paper, a method based on sum tables developed elsewhere is adapted for fast NCC calculation in ultrasound-based motion estimation, and is tested with respect to the speed enhancement of the specific application of ultrasound-based motion estimation. Both the numerator and denominator in the NCC definition are obtained through pre-calculated sum tables to eliminate redundancy of repeated NCC calculations. Unlike a previously reported method, a search region following the principle of motion estimation is applied in the construction of sum tables. Because an exhaustive search and high window overlap are typically used for highest quality imaging, the computational cost of the proposed method is significantly lower than that of the direct method using the NCC definition, without increasing bias and variance characteristics of the motion estimation or sacrificing the spatial resolution. Therefore, high quality, high spatial resolution, and high calculation speed can be all simultaneously obtained using the proposed methodology. The high efficiency of this method was verified using RF signals from a human abdominal aorta in vivo. For the parameters typically used, a real-time, very high frame rate of 310 frames/s was achieved for the motion estimation. The proposed method was also extended to 2-D NCC motion estimation and motion estimation with other algorithms. The technique could thus prove very useful and flexible for real-time motion estimation as well as in other fields such as optical flow and image registration.

Optimization of Content-Based Image Retrieval Functions

Conference Paper

Full-text available

Dec 2008

Feature extraction and similarity measurement are two important operations in content-based image re- trieval systems. We optimize and vectorize typical feature extraction algorithms, mean and standard deviation, and some similarity measurement functions such as the Sum-of-Squared-Differences (SSD), the Sum-of-Absolute Differences (SAD), and histogram intersection on a general-purpose processor enhanced with SIMD ex- tensions. In the straightforward implementation of the mean and standard deviation, there are two passes, one to compute the mean and one to compute the standard deviation. We use a single-loop approach that computes both the mean and the standard deviation in a single pass. This technique yields a speedup of up to 1.85 over the double-loop implementation. We vectorize the single-loop implementation using the MMX and SSE2 extensions. The vectorized versions improve performance by a factor of up to 14.49. In addition, we vectorize the SSD, SAD, and histogram intersection similarity measurements using SSE. The vectorized versions provide a maximum speedup of 1.45, 2.33, and 5.24 for the SSD, the SAD, and histogram intersection, respectively, over the optimized scalar imple- mentations.

A Review of SIMD Multimedia Extensions and their Usage in Scientific and Engineering Applications

Article

Nov 2008
COMPUT J

The volume and complexity of data processed by today's personal computers are increasing exponentially, placing incredible demands on the microprocessors. In the meantime, computing performance that can be achieved by increasing the clock speed of a microprocessor is reaching to physical limits thus making the architectural solutions more prominent. Due to this an important architectural feature is added to recent microprocessors, single instruction multiple data (SIMD), which is a set of instructions that can speed up an application performance by allowing basic operation to be performed on multiple data elements in parallel with fewer instructions. The SIMD computational technique was introduced in the IA-32 Intel® architecture with MMX technology and then further enhanced with Intel's introduction of streaming SIMD extensions (SSE), SSE 2 (SSE2) and SSE 3 (SSE3). Although programming using these SIMD extensions enables software to achieve higher performance, several exiting scientific applications are not affected. This paper gives an overview of SIMD multimedia extensions. The features of these extensions are introduced. Available methods for programming with multimedia instruction sets are discussed. It also reviews recent trends to use multimedia extensions to accelerate many applications such as multimedia, scientific and engineering applications, and argues for further use in other significant computationally intensive applications.

Versatility of extended subwords and the matrix register file

Article

Full-text available

May 2008
ACM T ARCHIT CODE OP

Extended subwords and the matrix register file (MRF) are two micro architectural techniques that address some of the limitations of existing SIMD architectures. Extended subwords are wider than the data stored in memory. Specifically, for every byte of data stored in memory, there are four extra bits in the media register file. This avoids the need for data-type conversion instructions. The MRF is a register file organization that provides both conventional row-wise, as well as column-wise, access to the register file. In other words, it allows to view the register file as a matrix in which corresponding subwords in different registers corresponds to a column of the matrix. It was introduced to accelerate matrix transposition which is a very common operation in multimedia applications. In this paper, we show that the MRF is very versatile, since it can also be used for other permutations than matrix transposition. Specifically, it is shown how it can be used to provide efficient access to strided data, as is needed in, e.g., color space conversion. Furthermore, it is shown that special-purpose instructions (SPIs), such as the sum-of-absolute differences (SAD) instruction, have limited usefulness when extended subwords and a few general SIMD instructions that we propose are supported, for the following reasons. First, when extended subwords are supported, the SAD instruction provides only a relatively small performance improvement. Second, the SAD instruction processes 8-bit subwords only, which is not sufficient for quarter-pixel resolution nor for cost functions used in image and video retrieval. Results obtained by extending the SimpleScalar toolset show that the proposed techniques provide a speedup of up to 3.00 over the MMX architecture. The results also show that using, at most, 13 extra media registers yields an additional performance improvement ranging from 1.38 to 1.57.

Speedup of MMX and MMMX over the C implementation for different kernels with block size 16 × 16 on 1-way issue out-of-order processor.

Similar publications

Citations