The structure of the pmaddsd instruction.

Source publication

SIMD Architectural Enhancements to Improve the Performance of the 2D Discrete Wavelet Transform

Conference Paper

Full-text available

Aug 2009

The 2D Discrete Wavelet Transform (DWT) is a time-consuming kernel in many multimedia applications such as JPEG2000 and MPEG-4. The 2D DWT consists of horizontal filtering along the rows followed by vertical filtering along the columns. The vertical filtering is easy to vectorize (assuming row-major order), but to vectorize the horizontal filtering...

Context 1

... an MAC unit that can perform a 32-bit single-precision floating-point multiplication with accumulation is a good solution to vectorize horizontal filtering of the convolution- based transform. As Figure 5 shows, multiplication of coeffi- cients and input samples is possible without using overhead instructions and replication of coefficients. The pmaddsd (parallel multiply and add single-precision values to double- precision) performs an SIMD multiply of the four single- precision floating-point values in the source operand by the The MRF is extended to floating-point numbers using the SSE register file. ...

View in full-text

A Massively Parallel FPGA-based Coprocessor for Support Vector Machines

Conference Paper

Full-text available

Jan 2009

We present a massively parallel FPGA-based coprocessor for Support Vector Machines (SVMs), a machine learning algorithm whose applications include recognition tasks such as learning scenes, situations and concepts, and reasoning tasks such as analyzing the recognized scenes and semantics. The coprocessor architecture, targeted at both SVM training...

Real-time multi-level wavelet lifting scheme on a fixed-point DSP for JPEG 2000 and scalable video coding

Conference Paper

Full-text available

Aug 2009

In this paper, we discuss the design and real-time implementation of a multi-level two-dimensional discrete wavelet transform (2D-DWT). The wavelet transform uses the well-known 5/3 filter coefficients and is implemented using the lifting framework. However, the transform allows complexity-scalable solutions with different latencies for scalable vi...

Fine-Grained Energy and Performance Profiling framework for Deep Convolutional Neural Networks

Preprint

Full-text available

May 2018

There is a huge demand for on-device execution of deep learning algorithms on mobile and embedded platforms. These devices present constraints on the application due to limited resources and power. Hence, developing energy-efficient solutions to address this issue will require innovation in algorithmic design, software and hardware. Such innovation...

Figure 1. Block diagram of the multiplier: Two 8-bit operands a and b...

Table 2 . Energy consumption for RLNC encoding followed by RLNC decoding.

Figure 4. Block diagram of the matrix inversion: One 8-bit value is...

Hardware Acceleration for RLNC: A Case Study Based on the Xtensa Processor with the Tensilica Instruction-Set Extension

Article

Full-text available

Sep 2018

Random linear network coding (RLNC) can greatly aid data transmission in lossy wireless networks. However, RLNC requires computationally complex matrix multiplications and inversions in finite fields (Galois fields). These computations are highly demanding for energy-constrained mobile devices. The presented case study evaluates hardware accelerati...

Lifting Scheme Cores for Wavelet Transform

Thesis

Full-text available

May 2016

David Bařina

The thesis focuses on efficient computation of the two-dimensional discrete wavelet transform. The state-of-the-art methods are extended in several ways to perform the transform in a single loop, possibly in a multi-scale fashion, using a compact streaming core. This core can further be appropriately reorganized to target the minimization of certain platform resources. The approach presented here nicely fits into common SIMD extensions, exploits the cache hierarchy of modern general-purpose processors, and is suitable for parallel evaluation. Finally, the approach presented is incorporated into the JPEG 2000 compression chain, in which it has proven to be fundamentally faster than widely used implementations.

LAR-LLC: A low-complexity multiresolution lossless image codec

Article

Jul 2015
IEEE T CIRC SYST VID

This paper presents a new scalable locally-adaptive resolution, lossless low-complexity (LAR-LLC) image codec. It is based on the Locally Adaptive Resolution (LAR) framework, which is a multi-resolution compression method supporting both lossy and lossless coding. To achieve an efficient low complexity solution, each processing stage of the LAR is modified. For the first step consisting of a pyramidal decomposition, a new reversible transform called “Hierarchical Diagonal S Transform (HD-ST)” is proposed. The HD-ST operates on sets of data pairs, requiring only shift and add/sub operations. The second step performs the prediction of the transformed coefficients. The prediction scheme considers both inter and intra-levels information, and involves fixed weights. Then a classification process is introduced to separate prediction errors into subclasses, using a context modelling approach. Finally, each subclass is coded by the Huffman coding algorithm. Results of the lossless compression experiments showed that LAR-LLC achieves the same compression performance as JPEG2000 with a lower complexity is coded by the Huffman coding algorithm. Results of thelossless compression experiments showed that LAR-LLC achieves the same compression performance as JPEG2000 with a lower complexity.

Algorithms and architectures for 2D discrete wavelet transform

Article

Full-text available

Nov 2012
J SUPERCOMPUT

Asadollah Shahbahrami

The 2D Discrete Wavelet Transform (DWT) is an important function in many multimedia applications, such as JPEG2000 and MPEG-4 standards, digital watermarking, and content-based multimedia information retrieval systems. The 2D DWT is computationally intensive than other functions, for instance, in the JPEG2000 standard. Therefore, different architectures have been proposed to process 2D DWT. The goal of this paper is to review and to evaluate different algorithms and different kinds of architectures such as application-specific integrated circuits, field programmable gate array, digital signal processors, graphics processing units, and General-Purpose Processors (GPPs) that are used to process 2D DWT. In addition, we implement the 2D DWT using different algorithms on GPPs enhanced with multimedia extensions. The experimental results show that the largest speedup of the vectorized 2D DWT over the scalar implementation is about 2.8 for first level decomposition. Furthermore, the characteristics of the 2D DWT and disadvantages of the existing architectures such as GPPs enhanced with SIMD instructions are discussed.

The structure of the pmaddsd instruction.

Context in source publication

Similar publications

Citations