Figure 17 - uploaded by Ben H. H. Juurlink
Content may be subject to copyright.
Speedup of MMX and MMMX over the C implementation for different kernels with block size 16 × 16 on 1-way issue out-of-order processor.

Speedup of MMX and MMMX over the C implementation for different kernels with block size 16 × 16 on 1-way issue out-of-order processor.

Source publication
Conference Paper
Full-text available
Microprocessor vendors have provided special-purpose in- structions such as psadbw and pdist to accelerate the sum- of-absolute differences (SAD) similarity measurement. The usefulness of these special-purpose instructions is limited ex- cept for the motion estimation kernel. This has several draw- backs. First, if the SAD becomes obsolete because...

Similar publications

Conference Paper
Full-text available
Feature extraction and similarity measurement are two important operations in content-based image re- trieval systems. We optimize and vectorize typical feature extraction algorithms, mean and standard deviation, and some similarity measurement functions such as the Sum-of-Squared-Differences (SSD), the Sum-of-Absolute Differences (SAD), and histog...

Citations

... These instructions are needed to make SIMD instructions practical [91]. For example, SPIs gain significant improvements compared to the equivalent SIMD operations [14,87]. In some applications, data are not ordered in correct forms, thus, SIMD reorganizing instructions are needed for implementing the SIMD version of the application, however, it incurs overheads on the implementation. ...
... SPIs are implemented to perform a set of applicable operations using a single SIMD instruction. Since the primary goal of SIMD extensions is improving the performance of multimedia kernels, microprocessor vendors have provided SPIs, particularly for multimedia kernels [87]. Intel company has developed SPIs in each generation of SSE extensions. ...
... Each SIMD technology has its own Instruction Set Architecture (ISA) to utilize Data Level Parallelism (DLP). Particularly, Special Purpose Instructions (SPIs) have been embedded in ISAs for certain kernels [7]. ...
... The Sum of Squared Differences (SSD) and Sum of Absolute Differences (SAD) as the primary similarity measurements [7] are applied on 16-bit and 8-bit integers, respectively. Since x86 architecture has an SPI of SAD, the SAD kernel is chosen to evaluate the compiler's behavior for special purpose instructions. ...
Article
Processor vendors have been expanding Single Instruction Multiple Data (SIMD) extensions to exploit data-level-parallelism in their General Purpose Processors (GPPs). Each SIMD technology such as Streaming SIMD Extensions (SSE) and Advanced Vector eXtensions (AVX) has its own Instruction Set Architecture (ISA) which equipped with Special Purpose Instructions (SPIs). In order to exploit these features, many programming approaches have been developed. Intrinsic Programming Model (IPM) is a low-level concept for explicit SIMDization. Besides, Compiler’s Automatic Vectorization (CAV) has been embedded in modern compilers such as Intel C++ compiler (ICC), GNU Compiler Collections (GCC) and LLVM for implicit vectorization. Each SIMDization shows different improvements because of different SIMD ISAs, vector register width, and programming approaches. Our goal in this paper is to evaluate the performance of explicit and implicit vectorization. Our experimental results show that the behavior of explicit vectorization on different compilers is almost the same compared to implicit vectorization. IPM improves the performance more than CAVs. In general, ICC and GCC compilers can more efficiently vectorize kernels and use SPI compared to LLVM. In addition, AVX2 technology is more useful for small matrices and compute-intensive kernels compared to large matrices and data-intensive kernels because of memory bottlenecks. Furthermore, CAVs fail to vectorize kernels which have overlapping and non-consecutive memory access patterns. The way of implementation of a kernel impacts the vectorization. In order to understand what kind of scalar implementations of an algorithm is suitable for vectorization, an approach based on code modification technique is proposed. Our experimental results show that scalar implementations that have either loop collapsing, loop unrolling, software pipelining, or loop exchange techniques can be efficiently vectorized compared to straightforward implementations.
... the efficiency of the methods thus depends on the workstation used. the sad can be implemented in parallel by using intel cPu instruction sets, as previously shown [52], [53], but was typically limited to a window size of 4, 8, or 16 samples and 8-bit data precision [54]. in contrast, the sum-table method can still work with an arbitrary window size, data precision, and motion estimator, and hence has a more general impact in its applications. it is also worth mentioning that the computational costs of different motion estimators are on the same order of magnitude with the use of the sum-table method because of its high efficiency ( fig. 6). ...
Article
Full-text available
High-precision motion estimation has become essential in ultrasound-based techniques such as time-domain Doppler and elastography. Normalized cross-correlation (NCC) has been shown as one of the best motion estimators. However, a significant drawback is its associated computational cost, especially when RF signals are used. In this paper, a method based on sum tables developed elsewhere is adapted for fast NCC calculation in ultrasound-based motion estimation, and is tested with respect to the speed enhancement of the specific application of ultrasound-based motion estimation. Both the numerator and denominator in the NCC definition are obtained through pre-calculated sum tables to eliminate redundancy of repeated NCC calculations. Unlike a previously reported method, a search region following the principle of motion estimation is applied in the construction of sum tables. Because an exhaustive search and high window overlap are typically used for highest quality imaging, the computational cost of the proposed method is significantly lower than that of the direct method using the NCC definition, without increasing bias and variance characteristics of the motion estimation or sacrificing the spatial resolution. Therefore, high quality, high spatial resolution, and high calculation speed can be all simultaneously obtained using the proposed methodology. The high efficiency of this method was verified using RF signals from a human abdominal aorta in vivo. For the parameters typically used, a real-time, very high frame rate of 310 frames/s was achieved for the motion estimation. The proposed method was also extended to 2-D NCC motion estimation and motion estimation with other algorithms. The technique could thus prove very useful and flexible for real-time motion estimation as well as in other fields such as optical flow and image registration.
... The main reason for this is that time is critical in on-line CBIR systems as the response time requires to be low for good interactivity [3]. Since both feature extraction and similarity measurement algorithms exhibit significant amounts of data-level parallelism [11,13], they could be implemented using the Single-Instruction Multiple-Data (SIMD) instructions supported by most General-Purpose Processors (GPPs). In this paper we make the following contributions compared to other works. ...
... Example is the SSE instruction psadbw [10] which accelerates motion estimation based on the SAD function. This instruction has limited usefulness except for the motion estimation kernel [11]. Shahbahrami et al. [11] have discussed the limitations of special-purpose instructions for similarity measurements supported by the SIMD extensions. ...
... This instruction has limited usefulness except for the motion estimation kernel [11]. Shahbahrami et al. [11] have discussed the limitations of special-purpose instructions for similarity measurements supported by the SIMD extensions. They have designed and implemented a few new SIMD instructions using wider registers to implement different similarity measurement functions. ...
Conference Paper
Full-text available
Feature extraction and similarity measurement are two important operations in content-based image re- trieval systems. We optimize and vectorize typical feature extraction algorithms, mean and standard deviation, and some similarity measurement functions such as the Sum-of-Squared-Differences (SSD), the Sum-of-Absolute Differences (SAD), and histogram intersection on a general-purpose processor enhanced with SIMD ex- tensions. In the straightforward implementation of the mean and standard deviation, there are two passes, one to compute the mean and one to compute the standard deviation. We use a single-loop approach that computes both the mean and the standard deviation in a single pass. This technique yields a speedup of up to 1.85 over the double-loop implementation. We vectorize the single-loop implementation using the MMX and SSE2 extensions. The vectorized versions improve performance by a factor of up to 14.49. In addition, we vectorize the SSD, SAD, and histogram intersection similarity measurements using SSE. The vectorized versions provide a maximum speedup of 1.45, 2.33, and 5.24 for the SSD, the SAD, and histogram intersection, respectively, over the optimized scalar imple- mentations.
... The usefulness of these special-purpose instructions is limited except for the motion estimation kernel. The limitations of these specialpurpose instructions such as psadbw and pdist in media SIMD extensions are discussed by Shahbahrami et al. [73]. In this paper, the authors design and evaluate a variety of SIMD instructions for different data types. ...
Article
The volume and complexity of data processed by today's personal computers are increasing exponentially, placing incredible demands on the microprocessors. In the meantime, computing performance that can be achieved by increasing the clock speed of a microprocessor is reaching to physical limits thus making the architectural solutions more prominent. Due to this an important architectural feature is added to recent microprocessors, single instruction multiple data (SIMD), which is a set of instructions that can speed up an application performance by allowing basic operation to be performed on multiple data elements in parallel with fewer instructions. The SIMD computational technique was introduced in the IA-32 Intel® architecture with MMX technology and then further enhanced with Intel's introduction of streaming SIMD extensions (SSE), SSE 2 (SSE2) and SSE 3 (SSE3). Although programming using these SIMD extensions enables software to achieve higher performance, several exiting scientific applications are not affected. This paper gives an overview of SIMD multimedia extensions. The features of these extensions are introduced. Available methods for programming with multimedia instruction sets are discussed. It also reviews recent trends to use multimedia extensions to accelerate many applications such as multimedia, scientific and engineering applications, and argues for further use in other significant computationally intensive applications.
... In addition, we discuss the new SIMD instructions and provide a preliminary evaluation of the hardware cost of the proposed techniques. More details about the MMMX architecture can be found in previous work [Shahbahrami et al. 2006a[Shahbahrami et al. , 2006b[Shahbahrami et al. , 2006c. ...
... In this section, we discuss in detail the SIMD implementations of the color space conversion and similarity measurement functions. The SIMD implementations of other kernels can be found in previous papers [Shahbahrami et al. 2006b[Shahbahrami et al. , 2006c. ...
Article
Full-text available
Extended subwords and the matrix register file (MRF) are two micro architectural techniques that address some of the limitations of existing SIMD architectures. Extended subwords are wider than the data stored in memory. Specifically, for every byte of data stored in memory, there are four extra bits in the media register file. This avoids the need for data-type conversion instructions. The MRF is a register file organization that provides both conventional row-wise, as well as column-wise, access to the register file. In other words, it allows to view the register file as a matrix in which corresponding subwords in different registers corresponds to a column of the matrix. It was introduced to accelerate matrix transposition which is a very common operation in multimedia applications. In this paper, we show that the MRF is very versatile, since it can also be used for other permutations than matrix transposition. Specifically, it is shown how it can be used to provide efficient access to strided data, as is needed in, e.g., color space conversion. Furthermore, it is shown that special-purpose instructions (SPIs), such as the sum-of-absolute differences (SAD) instruction, have limited usefulness when extended subwords and a few general SIMD instructions that we propose are supported, for the following reasons. First, when extended subwords are supported, the SAD instruction provides only a relatively small performance improvement. Second, the SAD instruction processes 8-bit subwords only, which is not sufficient for quarter-pixel resolution nor for cost functions used in image and video retrieval. Results obtained by extending the SimpleScalar toolset show that the proposed techniques provide a speedup of up to 3.00 over the MMX architecture. The results also show that using, at most, 13 extra media registers yields an additional performance improvement ranging from 1.38 to 1.57.