Fig 4 - uploaded by Aca Gacic
Content may be subject to copyright.
DFT performance before and after SPL compiler optimizations on a SPARC and MIPS architecture. SPARC: UltraSparc III, 750 MHz, Forte Developer 7 compiler, flags-fast-xO5; MIPS:MIPS R12000, 300 MHz, MIPSpro 7.3.1.1 compiler, flag-O3. (a) SPARC. (b) MIPS.

DFT performance before and after SPL compiler optimizations on a SPARC and MIPS architecture. SPARC: UltraSparc III, 750 MHz, Forte Developer 7 compiler, flags-fast-xO5; MIPS:MIPS R12000, 300 MHz, MIPSpro 7.3.1.1 compiler, flag-O3. (a) SPARC. (b) MIPS.

Source publication
Article
Full-text available
Fast changing, increasingly complex, and diverse computing platforms pose central problems in scientific computing: How to achieve, with reasonable effort, portable optimal performance? We present SPIRAL, which considers this problem for the performance-critical domain of linear digital signal processing (DSP) transforms. For a specified transform,...

Contexts in source publication

Context 1
... of the optimizations. Merely scalarizing arrays provides a sizable performance benefit as seen in Fig. 4. These graphs depict the execution time (lower is better) of the programs generated for 45 SPIRAL generated formulas for a on two different platforms. The line marked with stars and labeled "SPARC" in Fig. 4(a) [respectively, "MIPS" in Fig. 4(b)] shows the execution times achieved by the native SPARC (MIPS) compiler alone. The line ...
Context 2
... of the optimizations. Merely scalarizing arrays provides a sizable performance benefit as seen in Fig. 4. These graphs depict the execution time (lower is better) of the programs generated for 45 SPIRAL generated formulas for a on two different platforms. The line marked with stars and labeled "SPARC" in Fig. 4(a) [respectively, "MIPS" in Fig. 4(b)] shows the execution times achieved by the native SPARC (MIPS) compiler alone. The line marked with triangles and labeled "Scalarized" shows that every for- mula is improved by scalarizing the C code before sending it to the native compiler on both platforms. Note that we per- formed our MIPS ...
Context 3
... of the optimizations. Merely scalarizing arrays provides a sizable performance benefit as seen in Fig. 4. These graphs depict the execution time (lower is better) of the programs generated for 45 SPIRAL generated formulas for a on two different platforms. The line marked with stars and labeled "SPARC" in Fig. 4(a) [respectively, "MIPS" in Fig. 4(b)] shows the execution times achieved by the native SPARC (MIPS) compiler alone. The line marked with triangles and labeled "Scalarized" shows that every for- mula is improved by scalarizing the C code before sending it to the native compiler on both platforms. Note that we per- formed our MIPS experiments on an R12000 with the MIP- Spro ...
Context 4
... with the MIP- Spro compiler. See [47] for a case where the same experi- ments were performed with the same compiler on an R10000, but with different results. In that case, the MIPSpro compiler already achieved good performance without scalarizing or optimizing the code first. The line marked with bullets and labeled "Optimized" in both graphs of Fig. 4 represents the performance of the DFT codes after the entire optimization following the strategy described above. We observe that the additional optimizations beyond array scalarization signifi- cantly improve the code on SPARC, but not on ...

Similar publications

Article
Full-text available
The evaluation of osteoporotic disease from X-ray images presents a major challenge for pattern recognition and medical applications. Textured images from the bone microarchitecture of osteoporotic and healthy subjects show a high degree of similarity, thus drastically increasing the difficulty of classifying such textures. In this paper, we propos...
Article
Full-text available
Microstructural changes of subchondral bone constitute one of the figures characterising osteoarthritis on a structural level. Subchondral bone mineral density may reflect the complex relationship between bone and cartilage submitted to movement and loading. In this review, the authors discussed the interest of tibial subchondral bone mineral densi...
Article
Full-text available
The Schwartzwalder process was chosen for the production of ceramic TiO2 scaffolds and showed a fully open structure with a permeability for water of 39 %. The window sizes were 445 μm (45 ppi scaffolds) and 380 μm for the 60 ppi scaffolds. The porosity of all scaffolds was above 78 % (n=8). It was shown that scaffolds can be produced with defined...
Article
Full-text available
Background: Previous studies have demonstrated that an individual's race and ethnicity are important determinants of their areal bone mineral density (aBMD), assessed by dual-energy X-ray absorptiometry. However, there are few data assessing the impact of race on bone microarchitecture and strength estimates, particularly in older adolescent girls...
Article
Monoclinic KLa(MoO4)2:Eu3+ microarchitectures with different morphologies were synthesized by an EDTA-assisted hydrothermal method. The as-prepared samples were characterized by X-ray diffraction (XRD), scanning electron microscopy (SEM), Fourier transform infrared spectrometer (FT-IR) and photoluminescence (PL) spectrometer. It was found that the...

Citations

... In [16], [17], the authors put forth the celebrated Fastest Fourier Transform in the West (FFTW) C library for the FFT implementation on multiprocessor computers. Other works like UHFFT [18] and SPIRAL [19] also contributed to the FFT implementation on multiprocessor computers and DSP chips. What distinguishes our work from the above works is that our design considers pruned FFT arising from compact IFDMA. ...
Article
Full-text available
Interleaved Frequency Division Multiple Access (IFDMA) has the salient advantage of lower Peak-to-Average Power Ratio (PAPR) than its competitors like Orthogonal FDMA (OFDMA). A recent research effort of ours put forth a new IFDMA transceiver design significantly less complex than conventional IFDMA transceivers. The new IFDMA transceiver design reduces the complexity by exploiting a certain correspondence between the IFDMA signal processing and the Cooley-Tukey IFFT/FFT algorithmic structure so that IFDMA streams can be inserted/extracted at different stages of an IFFT/FFT module according to the sizes of the streams. Although our prior work has laid down the theoretical foundation for the new IFDMA transceiver’s structure, the practical realization of the transceiver on specific hardware with resource constraints has not been carefully investigated. This paper is an attempt to fill the gap. Specifically, this paper puts forth a heuristic algorithm called multi-priority scheduling (MPS) to schedule the execution of the butterfly computations in the IFDMA transceiver with the constraint of a limited number of hardware processors. The resulting FFT computation, referred to as MPS-FFT, has a much lower computation time than conventional FFT computation when applied to the IFDMA signal processing. Importantly, we derive a lower bound for the optimal IFDMA FFT computation time to benchmark MPS-FFT. Our experimental results indicate that when the number of hardware processors is a power of two: 1) MPS-FFT has near-optimal computation time; 2) MPS-FFT incurs less than 44.13% of the computation time of the conventional pipelined FFT.
... Thus, for large stream graphs, embedded memories quickly become the bottleneck resource. Spiral [77] is a code generation system for Digital signal Processing (DSP) transforms which has been extended to generate DSP IP cores for FPGA [78]. RIPL [79] employs image processing algorithmic skeletons as a general framework for the application of user-defined functions, generating efficient streaming hardware pipelines. ...
Article
Full-text available
Modern embedded image processing deployment systems are heterogeneous combinations of general-purpose and specialized processors, custom ASIC accelerators and bespoke hardware accelerators. This paper offers a primer on hardware acceleration of image processing, focusing on embedded, real-time applications. We then survey the landscape of High Level Synthesis technologies that are amenable to the domain, as well as new-generation Hardware Description Languages, and present our ongoing work on IMP-lang, a language for early stage design of heterogeneous image processing systems. We show that hardware acceleration is not just a process of converting a piece of computation into an equivalent hardware system: that naive approach offers, in most cases, little benefit. Instead, acceleration must take into account how data is streamed throughout the system, and optimize that streaming accordingly. We show that the choice of tooling plays an important role in the results of acceleration. Different tools, in function of the underlying language paradigm, produce wildly different results across performance, size, and power consumption metrics. Finally, we show that bringing heterogeneous considerations to the language level offers significant advantages to early design estimation, allowing designers to partition their algorithms more efficiently, iterating towards a convergent design that can then be implemented across heterogeneous elements accordingly.
... where a ij are the entries of A. We use these three classes of structures, and a combination of them, as a representative set that can be combined to form more complex relationships [16,20]. Our framework is not limited to only these structures, and they serve as an example to evaluate its performance. ...
Preprint
Full-text available
The performance of sparse matrix computation highly depends on the matching of the matrix format with the underlying structure of the data being computed on. Different sparse matrix formats are suitable for different structures of data. Therefore, the first challenge is identifying the matrix structure before the computation to match it with an appropriate data format. The second challenge is to avoid reading the entire dataset before classifying it. This can be done by identifying the matrix structure through samples and their features. Yet, it is possible that global features cannot be determined from a sampling set and must instead be inferred from local features. To address these challenges, we develop a framework that generates sparse matrix structure classifiers using graph convolutional networks. The framework can also be extended to other matrix structures using user-provided generators. The approach achieves 97% classification accuracy on a set of representative sparse matrix shapes.
... We distinguish between two different research topics in the field of auto-tuning research. Some auto-tuning research focuses on (1) auto-tuning compiler-generated code optimizations [6,7,8,9], whereas this paper focuses on (2) software autotuning [10,11]. Ashouri et al. [12] wrote an excellent survey on machine-learning methods for compiler-based auto-tuning. ...
Preprint
Full-text available
Graphic Processing Units (GPUs) have become ubiquitous in scientific computing. However, writing efficient GPU kernels can be challenging due to the need for careful code tuning. To automatically explore the kernel optimization space, several auto-tuning tools - like Kernel Tuner - have been proposed. Unfortunately, these existing auto-tuning tools often do not concern themselves with integration of tuning results back into applications, which puts a significant implementation and maintenance burden on application developers. In this work, we present Kernel Launcher: an easy-to-use C++ library that simplifies the creation of highly-tuned CUDA applications. With Kernel Launcher, programmers can capture kernel launches, tune the captured kernels for different setups, and integrate the tuning results back into applications using runtime compilation. To showcase the applicability of Kernel Launcher, we consider a real-world computational fluid dynamics code and tune its kernels for different GPUs, input domains, and precisions.
... The second category consists of DSLs such as Lift (Steuwer et al., 2015), Opt (DeVito et al., 2016), Halide (Ragan-Kelley et al., 2013), Diderot (Chiw et al., 2012), and OptiML (Sujeeth et al., 2011), which generate parallel code from their high-level programs. The third category is the DSLs which focus on generating efficient machine code for fixed size linear algbra problems such as Spiral (Puschel et al., 2005) and ...
Preprint
Full-text available
Automatic differentiation (AD) is a technique for computing the derivative of a function represented by a program. This technique is considered as the de-facto standard for computing the differentiation in many machine learning and optimisation software tools. Despite the practicality of this technique, the performance of the differentiated programs, especially for functional languages and in the presence of vectors, is suboptimal. We present an AD system for a higher-order functional array-processing language. The core functional language underlying this system simultaneously supports both source-to-source forward-mode AD and global optimisations such as loop transformations. In combination, gradient computation with forward-mode AD can be as efficient as reverse mode, and the Jacobian matrices required for numerical algorithms such as Gauss-Newton and Levenberg-Marquardt can be efficiently computed.
... Thus, for large stream graphs, embedded memories quickly become the bottleneck resource. Spiral [73] is a code generation system for Digital signal Processing (DSP) transforms which has been extended to generate DSP IP cores for FPGA [74]. RIPL [75] employs image processing algorithmic skeletons as a general framework for the application of user-defined functions, generating efficient streaming hardware pipelines. ...
Preprint
Full-text available
Modern embedded image processing deployment systems are heterogeneous combinations of general-purpose and specialized processors, custom ASIC accelerators and bespoke hardware accelerators. This paper offers a primer on hardware acceleration of image processing, focusing on embedded, real-time applications. We then survey the landscape of High Level Synthesis technologies that are amenable to the domain, and present our ongoing work on IMP-Lang, a language for early stage design of heterogeneous image processing systems. We show that hardware acceleration is not just a process of converting a piece of computation into an equivalent hardware system: that naive approach offers, in most cases, little benefit. Instead, acceleration must take into account how data is streamed throughout the system, and optimize that streaming accordingly. We show that the choice of tooling plays an important role in the results of acceleration. Different tools, in function of the underlying language paradigm, produce wildly different results across performance, size, and power consumption metrics. Finally, we show that bringing heterogeneous considerations to the language level offers significant advantages to early design estimation, allowing designers to partition their algorithms more efficiently, iterating towards a convergent design that can then be implemented across heterogeneous elements accordingly.
... Auto-tuners. Auto-tuners have proven to be a successful performance tuning approach, represented by ATLAS [71], FFTW [72], SPIRAL [73], and OSKI [74], for the increasingly diverse and complicated computer architecture designs. For sparse linear algebra, SMAT [15], clSpMV [19], TileSpMV [75] Naser Sedaghati et al. [76] and CSX [20] select the best artificial format and SpMV implementation for the given matrix; while IA-SpGEMM [77] selects the best formats for SpGEMM. ...
Preprint
Sparse Matrix-Vector multiplication (SpMV) is an essential computational kernel in many application scenarios. Tens of sparse matrix formats and implementations have been proposed to compress the memory storage and speed up SpMV performance. We develop AlphaSparse, a superset of all existing works that goes beyond the scope of human-designed format(s) and implementation(s). AlphaSparse automatically \emph{creates novel machine-designed formats and SpMV kernel implementations} entirely from the knowledge of input sparsity patterns and hardware architectures. Based on our proposed Operator Graph that expresses the path of SpMV format and kernel design, AlphaSparse consists of three main components: Designer, Format \& Kernel Generator, and Search Engine. It takes an arbitrary sparse matrix as input while outputs the performance machine-designed format and SpMV implementation. By extensively evaluating 843 matrices from SuiteSparse Matrix Collection, AlphaSparse achieves significant performance improvement by 3.2$\times$ on average compared to five state-of-the-art artificial formats and 1.5$\times$ on average (up to 2.7$\times$) over the up-to-date implementation of traditional auto-tuning philosophy.
... Optimizing numerical libraries with auto-tuning has a long history on CPUs. Auto-tuning has been applied to a broad spectrum of applications but is mainly grouped into two categories: compiler-generated code optimizations [10], [11] and user-defined code parameterization [3], [12]. This work adopts the latter as it is more generic. ...
... There is a vast literature on tensor and linear algebra systems in the compilers and HPC communities. However, most of them focus on dense data (e.g., [13,34,36,39,46,47]). Similarly, the database community studied Array DBMS [7,48] and SQL extensions for matrices (e.g., [31,41,56]), which are also primarily designed for dense data. ...
Preprint
Full-text available
Tensor programs often need to process large tensors (vectors, matrices, or higher order tensors) that require a specialized storage format for their memory layout. Several such layouts have been proposed in the literature, such as the Coordinate Format, the Compressed Sparse Row format, and many others, that were especially designed to optimally store tensors with specific sparsity properties. However, existing tensor processing systems require specialized extensions in order to take advantage of every new storage format. In this paper we describe a system that allows users to define flexible storage formats in a declarative tensor query language, similar to the language used by the tensor program. The programmer only needs to write storage mappings, which describe, in a declarative way, how the tensors are laid out in main memory. Then, we describe a cost-based optimizer that optimizes the tensor program for the specific memory layout. We demonstrate empirically significant performance improvements compared to state-of-the-art tensor processing systems.
... To the best of our knowledge, while these techniques are applicable o -the-shelf to the highly regular codes we generate, none is implementing the random vector packing strategies we present in Section 3, nor fusion of reductions for better SIMD occupancy in such an automated system as MACVETH. The Spiral system illustrates very well the ability to mine for e ective machine-speci c SIMD implementations [16,24,33] albeit via an algebraic system to search for equivalent implementations. In contrast, we develop data packing SIMD recipes that are tuned to the non-regular strided accesses in reconstructed loops, by rst automatically characterizing their performance pro le on the target machine, avoiding the limitations of static cost models. ...
Conference Paper
Sparse computations, such as sparse matrix-dense vector multiplication , are notoriously hard to optimize due to their irregularity and memory-boundedness. Solutions to improve the performance of sparse computations have been proposed, ranging from hardware-based such as gather-scatter instructions, to software ones such as generalized and dedicated sparse formats, used together with specialized executor programs for di erent hardware targets. These sparse computations are often performed on read-only sparse structures: while the data themselves are variable, the sparsity structure itself does not change. Indeed, sparse formats such as CSR have a typically high cost to insert/remove nonzero elements in the representation. The typical use case is to not modify the sparsity during possibly repeated computations on the same sparse structure. In this work, we exploit the possibility to generate a specialized executor program dedicated to the particular sparsity structure of an input matrix. It creates opportunities to remove indirection arrays and synthesize regular, vectorizable code for such computations. But, at the same time, it introduces challenges in code size and instruction generation, as well as e cient SIMD vectorization. We present novel techniques and extensive experimental results to e ciently generate SIMD vector code for data-speci c sparse computations, and study the limits in terms of applicability and performance of our techniques compared to state-of-practice high-performance libraries like Intel MKL. CCS Concepts • Software and its engineering → Source code generation; • Computing methodologies → Vector / streaming algorithms. * With AMD at the time of publication.