Figure 5 - uploaded by Ben H. H. Juurlink
Content may be subject to copyright.
The structure of the pmaddsd instruction.  

The structure of the pmaddsd instruction.  

Source publication
Conference Paper
Full-text available
The 2D Discrete Wavelet Transform (DWT) is a time-consuming kernel in many multimedia applications such as JPEG2000 and MPEG-4. The 2D DWT consists of horizontal filtering along the rows followed by vertical filtering along the columns. The vertical filtering is easy to vectorize (assuming row-major order), but to vectorize the horizontal filtering...

Context in source publication

Context 1
... an MAC unit that can perform a 32-bit single-precision floating-point multiplication with accumulation is a good solution to vectorize horizontal filtering of the convolution- based transform. As Figure 5 shows, multiplication of coeffi- cients and input samples is possible without using overhead instructions and replication of coefficients. The pmaddsd (parallel multiply and add single-precision values to double- precision) performs an SIMD multiply of the four single- precision floating-point values in the source operand by the The MRF is extended to floating-point numbers using the SSE register file. ...

Similar publications

Conference Paper
Full-text available
We present a massively parallel FPGA-based coprocessor for Support Vector Machines (SVMs), a machine learning algorithm whose applications include recognition tasks such as learning scenes, situations and concepts, and reasoning tasks such as analyzing the recognized scenes and semantics. The coprocessor architecture, targeted at both SVM training...
Conference Paper
Full-text available
In this paper, we discuss the design and real-time implementation of a multi-level two-dimensional discrete wavelet transform (2D-DWT). The wavelet transform uses the well-known 5/3 filter coefficients and is implemented using the lifting framework. However, the transform allows complexity-scalable solutions with different latencies for scalable vi...
Preprint
Full-text available
There is a huge demand for on-device execution of deep learning algorithms on mobile and embedded platforms. These devices present constraints on the application due to limited resources and power. Hence, developing energy-efficient solutions to address this issue will require innovation in algorithmic design, software and hardware. Such innovation...
Article
Full-text available
Random linear network coding (RLNC) can greatly aid data transmission in lossy wireless networks. However, RLNC requires computationally complex matrix multiplications and inversions in finite fields (Galois fields). These computations are highly demanding for energy-constrained mobile devices. The presented case study evaluates hardware accelerati...

Citations

... The fine-grained parallelization refers to exploiting the SIMD extensions (namely, MMX, and SSE). This kind was investigated at various levels in [32,34,29,35,27,39,36,40,41,13,42]. The most efficient solutions are presented in [13]. ...
Thesis
Full-text available
The thesis focuses on efficient computation of the two-dimensional discrete wavelet transform. The state-of-the-art methods are extended in several ways to perform the transform in a single loop, possibly in a multi-scale fashion, using a compact streaming core. This core can further be appropriately reorganized to target the minimization of certain platform resources. The approach presented here nicely fits into common SIMD extensions, exploits the cache hierarchy of modern general-purpose processors, and is suitable for parallel evaluation. Finally, the approach presented is incorporated into the JPEG 2000 compression chain, in which it has proven to be fundamentally faster than widely used implementations.
... It means that the last Huffman coding step in level l + 1 can run simultaneously with the prediction step in level l, both handled by different threads. Some parallel processing is also possible for JPEG2K for the DWT part [36] or the EBCOT module [37]. Recently, new parallel processing methods and advanced graphic processing units have also been applied to JPEG2K for a significant speedup in the execution time [38], [39]. ...
Article
This paper presents a new scalable locally-adaptive resolution, lossless low-complexity (LAR-LLC) image codec. It is based on the Locally Adaptive Resolution (LAR) framework, which is a multi-resolution compression method supporting both lossy and lossless coding. To achieve an efficient low complexity solution, each processing stage of the LAR is modified. For the first step consisting of a pyramidal decomposition, a new reversible transform called “Hierarchical Diagonal S Transform (HD-ST)” is proposed. The HD-ST operates on sets of data pairs, requiring only shift and add/sub operations. The second step performs the prediction of the transformed coefficients. The prediction scheme considers both inter and intra-levels information, and involves fixed weights. Then a classification process is introduced to separate prediction errors into subclasses, using a context modelling approach. Finally, each subclass is coded by the Huffman coding algorithm. Results of the lossless compression experiments showed that LAR-LLC achieves the same compression performance as JPEG2000 with a lower complexity is coded by the Huffman coding algorithm. Results of thelossless compression experiments showed that LAR-LLC achieves the same compression performance as JPEG2000 with a lower complexity.
... In addition, we have implemented the 2D DWT using SIMD instructions [12,25,78,80,81]. In recent published paper [80], the SIMD implementation of the RCWT and LBWT have been compared to each other. ...
Article
Full-text available
The 2D Discrete Wavelet Transform (DWT) is an important function in many multimedia applications, such as JPEG2000 and MPEG-4 standards, digital watermarking, and content-based multimedia information retrieval systems. The 2D DWT is computationally intensive than other functions, for instance, in the JPEG2000 standard. Therefore, different architectures have been proposed to process 2D DWT. The goal of this paper is to review and to evaluate different algorithms and different kinds of architectures such as application-specific integrated circuits, field programmable gate array, digital signal processors, graphics processing units, and General-Purpose Processors (GPPs) that are used to process 2D DWT. In addition, we implement the 2D DWT using different algorithms on GPPs enhanced with multimedia extensions. The experimental results show that the largest speedup of the vectorized 2D DWT over the scalar implementation is about 2.8 for first level decomposition. Furthermore, the characteristics of the 2D DWT and disadvantages of the existing architectures such as GPPs enhanced with SIMD instructions are discussed.