DFT performance before and after SPL compiler optimizations on a SPARC and MIPS architecture. SPARC: UltraSparc III, 750 MHz, Forte Developer 7 compiler, flags-fast-xO5; MIPS:MIPS R12000, 300 MHz, MIPSpro 7.3.1.1 compiler, flag-O3. (a) SPARC. (b) MIPS.

Source publication

Spiral: code generation for DSP transforms

Article

Full-text available

Mar 2005

Fast changing, increasingly complex, and diverse computing platforms pose central problems in scientific computing: How to achieve, with reasonable effort, portable optimal performance? We present SPIRAL, which considers this problem for the performance-critical domain of linear digital signal processing (DSP) transforms. For a specified transform,...

Context 1

View in full-text

Context 2

... of the optimizations. Merely scalarizing arrays provides a sizable performance benefit as seen in Fig. 4. These graphs depict the execution time (lower is better) of the programs generated for 45 SPIRAL generated formulas for a on two different platforms. The line marked with stars and labeled "SPARC" in Fig. 4(a) [respectively, "MIPS" in Fig. 4(b)] shows the execution times achieved by the native SPARC (MIPS) compiler alone. The line marked with triangles and labeled "Scalarized" shows that every for- mula is improved by scalarizing the C code before sending it to the native compiler on both platforms. Note that we per- formed our MIPS ...

View in full-text

Context 3

View in full-text

Context 4

... with the MIP- Spro compiler. See [47] for a case where the same experi- ments were performed with the same compiler on an R10000, but with different results. In that case, the MIPSpro compiler already achieved good performance without scalarizing or optimizing the code first. The line marked with bullets and labeled "Optimized" in both graphs of Fig. 4 represents the performance of the DFT codes after the entire optimization following the strategy described above. We observe that the additional optimizations beyond array scalarization signifi- cantly improve the code on SPARC, but not on ...

View in full-text

One Dimensional Local Binary Pattern for Bone Texture Characterization

Article

Full-text available

Sep 2012

The evaluation of osteoporotic disease from X-ray images presents a major challenge for pattern recognition and medical applications. Textured images from the bone microarchitecture of osteoporotic and healthy subjects show a high degree of similarity, thus drastically increasing the difficulty of classifying such textures. In this paper, we propos...

Assessment of Bone Mineral Density and radiographic parameter texture at the tibial subchondral bone

Article

Full-text available

Nov 2012

Microstructural changes of subchondral bone constitute one of the figures characterising osteoarthritis on a structural level. Subchondral bone mineral density may reflect the complex relationship between bone and cartilage submitted to movement and loading. In this review, the authors discussed the interest of tibial subchondral bone mineral densi...

Processing and Characterisation of a Potential TiO 2 Scaffold

Article

Full-text available

Dec 2003

The Schwartzwalder process was chosen for the production of ceramic TiO2 scaffolds and showed a fully open structure with a permeability for water of 39 %. The window sizes were 445 μm (45 ppi scaffolds) and 380 μm for the 60 ppi scaffolds. The porosity of all scaffolds was above 78 % (n=8). It was shown that scaffolds can be produced with defined...

Racial Differences in Bone Microarchitecture and Estimated Strength at the Distal Radius and Distal Tibia in Older Adolescent Girls: a Cross-Sectional Study

Article

Full-text available

Jul 2016

Background: Previous studies have demonstrated that an individual's race and ethnicity are important determinants of their areal bone mineral density (aBMD), assessed by dual-energy X-ray absorptiometry. However, there are few data assessing the impact of race on bone microarchitecture and strength estimates, particularly in older adolescent girls...

EDTA-assisted hydrothermal synthesis of KLa(MoO4)2:Eu3+ microcrystals and their luminescence properties

Article

Jul 2016

Monoclinic KLa(MoO4)2:Eu3+ microarchitectures with different morphologies were synthesized by an EDTA-assisted hydrothermal method. The as-prepared samples were characterized by X-ray diffraction (XRD), scanning electron microscopy (SEM), Fourier transform infrared spectrometer (FT-IR) and photoluminescence (PL) spectrometer. It was found that the...

Efficient FFT Computation in IFDMA Transceivers

Article

Full-text available

Oct 2023
IEEE T WIREL COMMUN

Interleaved Frequency Division Multiple Access (IFDMA) has the salient advantage of lower Peak-to-Average Power Ratio (PAPR) than its competitors like Orthogonal FDMA (OFDMA). A recent research effort of ours put forth a new IFDMA transceiver design significantly less complex than conventional IFDMA transceivers. The new IFDMA transceiver design reduces the complexity by exploiting a certain correspondence between the IFDMA signal processing and the Cooley-Tukey IFFT/FFT algorithmic structure so that IFDMA streams can be inserted/extracted at different stages of an IFFT/FFT module according to the sizes of the streams. Although our prior work has laid down the theoretical foundation for the new IFDMA transceiver’s structure, the practical realization of the transceiver on specific hardware with resource constraints has not been carefully investigated. This paper is an attempt to fill the gap. Specifically, this paper puts forth a heuristic algorithm called multi-priority scheduling (MPS) to schedule the execution of the butterfly computations in the IFDMA transceiver with the constraint of a limited number of hardware processors. The resulting FFT computation, referred to as MPS-FFT, has a much lower computation time than conventional FFT computation when applied to the IFDMA signal processing. Importantly, we derive a lower bound for the optimal IFDMA FFT computation time to benchmark MPS-FFT. Our experimental results indicate that when the number of hardware processors is a power of two: 1) MPS-FFT has near-optimal computation time; 2) MPS-FFT incurs less than 44.13% of the computation time of the conventional pipelined FFT.

The Good, the Bad and the Ugly: Practices and Perspectives on Hardware Acceleration for Embedded Image Processing

Article

Full-text available

Jul 2023
J SIGNAL PROCESS SYS

Modern embedded image processing deployment systems are heterogeneous combinations of general-purpose and specialized processors, custom ASIC accelerators and bespoke hardware accelerators. This paper offers a primer on hardware acceleration of image processing, focusing on embedded, real-time applications. We then survey the landscape of High Level Synthesis technologies that are amenable to the domain, as well as new-generation Hardware Description Languages, and present our ongoing work on IMP-lang, a language for early stage design of heterogeneous image processing systems. We show that hardware acceleration is not just a process of converting a piece of computation into an equivalent hardware system: that naive approach offers, in most cases, little benefit. Instead, acceleration must take into account how data is streamed throughout the system, and optimize that streaming accordingly. We show that the choice of tooling plays an important role in the results of acceleration. Different tools, in function of the underlying language paradigm, produce wildly different results across performance, size, and power consumption metrics. Finally, we show that bringing heterogeneous considerations to the language level offers significant advantages to early design estimation, allowing designers to partition their algorithms more efficiently, iterating towards a convergent design that can then be implemented across heterogeneous elements accordingly.

Observe Locally, Classify Globally: Using GNNs to Identify Sparse Matrix Structure

Preprint

Full-text available

Jul 2023

The performance of sparse matrix computation highly depends on the matching of the matrix format with the underlying structure of the data being computed on. Different sparse matrix formats are suitable for different structures of data. Therefore, the first challenge is identifying the matrix structure before the computation to match it with an appropriate data format. The second challenge is to avoid reading the entire dataset before classifying it. This can be done by identifying the matrix structure through samples and their features. Yet, it is possible that global features cannot be determined from a sampling set and must instead be inferred from local features. To address these challenges, we develop a framework that generates sparse matrix structure classifiers using graph convolutional networks. The framework can also be extended to other matrix structures using user-provided generators. The approach achieves 97% classification accuracy on a set of representative sparse matrix shapes.

Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications

Preprint

Full-text available

Mar 2023

Graphic Processing Units (GPUs) have become ubiquitous in scientific computing. However, writing efficient GPU kernels can be challenging due to the need for careful code tuning. To automatically explore the kernel optimization space, several auto-tuning tools - like Kernel Tuner - have been proposed. Unfortunately, these existing auto-tuning tools often do not concern themselves with integration of tuning results back into applications, which puts a significant implementation and maintenance burden on application developers. In this work, we present Kernel Launcher: an easy-to-use C++ library that simplifies the creation of highly-tuned CUDA applications. With Kernel Launcher, programmers can capture kernel launches, tune the captured kernels for different setups, and integrate the tuning results back into applications using runtime compilation. To showcase the applicability of Kernel Launcher, we consider a real-world computational fluid dynamics code and tune its kernels for different GPUs, input domains, and precisions.

Efficient and Sound Differentiable Programming in a Functional Array-Processing Language

Preprint

Full-text available

Dec 2022

Automatic differentiation (AD) is a technique for computing the derivative of a function represented by a program. This technique is considered as the de-facto standard for computing the differentiation in many machine learning and optimisation software tools. Despite the practicality of this technique, the performance of the differentiated programs, especially for functional languages and in the presence of vectors, is suboptimal. We present an AD system for a higher-order functional array-processing language. The core functional language underlying this system simultaneously supports both source-to-source forward-mode AD and global optimisations such as loop transformations. In combination, gradient computation with forward-mode AD can be as efficient as reverse mode, and the Jacobian matrices required for numerical algorithms such as Gauss-Newton and Levenberg-Marquardt can be efficiently computed.

The Good, the Bad and the Ugly: Practices and Perspectives on Hardware Acceleration for Embedded Image Processing

Preprint

Full-text available

Nov 2022

Modern embedded image processing deployment systems are heterogeneous combinations of general-purpose and specialized processors, custom ASIC accelerators and bespoke hardware accelerators. This paper offers a primer on hardware acceleration of image processing, focusing on embedded, real-time applications. We then survey the landscape of High Level Synthesis technologies that are amenable to the domain, and present our ongoing work on IMP-Lang, a language for early stage design of heterogeneous image processing systems. We show that hardware acceleration is not just a process of converting a piece of computation into an equivalent hardware system: that naive approach offers, in most cases, little benefit. Instead, acceleration must take into account how data is streamed throughout the system, and optimize that streaming accordingly. We show that the choice of tooling plays an important role in the results of acceleration. Different tools, in function of the underlying language paradigm, produce wildly different results across performance, size, and power consumption metrics. Finally, we show that bringing heterogeneous considerations to the language level offers significant advantages to early design estimation, allowing designers to partition their algorithms more efficiently, iterating towards a convergent design that can then be implemented across heterogeneous elements accordingly.

AlphaSparse: Generating High Performance SpMV Codes Directly from Sparse Matrices

Preprint

Nov 2022

Sparse Matrix-Vector multiplication (SpMV) is an essential computational kernel in many application scenarios. Tens of sparse matrix formats and implementations have been proposed to compress the memory storage and speed up SpMV performance. We develop AlphaSparse, a superset of all existing works that goes beyond the scope of human-designed format(s) and implementation(s). AlphaSparse automatically \emph{creates novel machine-designed formats and SpMV kernel implementations} entirely from the knowledge of input sparsity patterns and hardware architectures. Based on our proposed Operator Graph that expresses the path of SpMV format and kernel design, AlphaSparse consists of three main components: Designer, Format \& Kernel Generator, and Search Engine. It takes an arbitrary sparse matrix as input while outputs the performance machine-designed format and SpMV implementation. By extensively evaluating 843 matrices from SuiteSparse Matrix Collection, AlphaSparse achieves significant performance improvement by 3.2$\times$ on average compared to five state-of-the-art artificial formats and 1.5$\times$ on average (up to 2.7$\times$) over the up-to-date implementation of traditional auto-tuning philosophy.

ML-based Performance Portability for Time-Dependent Density Functional Theory in HPC Environments

Conference Paper

Nov 2022

Optimizing Tensor Programs on Flexible Storage

Preprint

Full-text available

Oct 2022

Tensor programs often need to process large tensors (vectors, matrices, or higher order tensors) that require a specialized storage format for their memory layout. Several such layouts have been proposed in the literature, such as the Coordinate Format, the Compressed Sparse Row format, and many others, that were especially designed to optimally store tensors with specific sparsity properties. However, existing tensor processing systems require specialized extensions in order to take advantage of every new storage format. In this paper we describe a system that allows users to define flexible storage formats in a declarative tensor query language, similar to the language used by the tensor program. The programmer only needs to write storage mappings, which describe, in a declarative way, how the tensors are laid out in main memory. Then, we describe a cost-based optimizer that optimizes the tensor program for the specific memory layout. We demonstrate empirically significant performance improvements compared to state-of-the-art tensor processing systems.

Custom High-Performance Vector Code Generation for Data-Specific Sparse Computations

Conference Paper

Oct 2022

Sparse computations, such as sparse matrix-dense vector multiplication , are notoriously hard to optimize due to their irregularity and memory-boundedness. Solutions to improve the performance of sparse computations have been proposed, ranging from hardware-based such as gather-scatter instructions, to software ones such as generalized and dedicated sparse formats, used together with specialized executor programs for di erent hardware targets. These sparse computations are often performed on read-only sparse structures: while the data themselves are variable, the sparsity structure itself does not change. Indeed, sparse formats such as CSR have a typically high cost to insert/remove nonzero elements in the representation. The typical use case is to not modify the sparsity during possibly repeated computations on the same sparse structure. In this work, we exploit the possibility to generate a specialized executor program dedicated to the particular sparsity structure of an input matrix. It creates opportunities to remove indirection arrays and synthesize regular, vectorizable code for such computations. But, at the same time, it introduces challenges in code size and instruction generation, as well as e cient SIMD vectorization. We present novel techniques and extensive experimental results to e ciently generate SIMD vector code for data-speci c sparse computations, and study the limits in terms of applicability and performance of our techniques compared to state-of-practice high-performance libraries like Intel MKL. CCS Concepts • Software and its engineering → Source code generation; • Computing methodologies → Vector / streaming algorithms. * With AMD at the time of publication.

DFT performance before and after SPL compiler optimizations on a SPARC and MIPS architecture. SPARC: UltraSparc III, 750 MHz, Forte Developer 7 compiler, flags-fast-xO5; MIPS:MIPS R12000, 300 MHz, MIPSpro 7.3.1.1 compiler, flag-O3. (a) SPARC. (b) MIPS.

Contexts in source publication

Similar publications

Citations