Conference Paper

The Cache Performance and Optimizations of Blocked Algorithms.

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Blocking is a well-known optimization technique for improving the effectiveness of memory hierarchies. Instead of operating on entire rows or columns of an array, blocked algorithms operate on submatrices or blocks, so that data loaded into the faster levels of the memory hierarchy are reused. This paper presents cache performance data for blocked programs and evaluates several optimizations to improve this performance. The data is obtained by a theoretical model of data conflicts in the cache, which has been validated by large amounts of simulation. We show that the degree of cache interference is highly sensitive to the stride of data accesses and the size of the blocks, and can cause wide variations in machine performance for different matrix sizes. The conventional wisdom of trying to use the entire cache, or even a fixed fraction of the cache, is incorrect. If a fixed block size is used for a given cache size, the block size that minimizes the expected number of cache misses is very small. Tailoring the block size according to the matrix size and cache parameters can improve the average performance and reduce the variance in performance for different matrix sizes. Finally, whenever possible, it is beneficial to copy non-contiguous reused data into consecutive locations.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... As another example, Krylov-based least squares solvers can also be efficiently deployed [9], so long as matrix-vector and matrix-transpose-vector products can be efficiently computed. Unfortunately, if the system is sufficiently large that it cannot be stored in memory, then Krylov-based least squares solvers are substantially slowed down also because of the memory movement costs needed to read in the matrix multiple times per iteration [14]. ...
... When it comes to solving least squares problems, controlling the solver's accuracy depends on tracking the progress of the iterations and defining clear stopping conditions, which are typically achieved by using the norm of the gradient of the least squares subproblem. Unfortunately, the gradient of a large least squares problem is calculated by applying a very large matrix in both its original and transposed orientation to a vector-a procedure that is very costly because of its guaranteed violation of the principal of spatial locality for memory accesses [14] (excepting the case in which the matrix is symmetric). This issue is further exacerbated for a randomized solver: the gradient at the iterates of a randomized solver is never explicitly calculated, and, even if it were calculated occasionally for monitoring progress, it would be less reliable as we now explain. ...
... Note, we allow m and n to be arbitrary, so our methodology applies to overdetermined, underdetermined, and rank-deficient systems. Owing to the size of A, we can only access A through matrix-vector multiplications; similarly, though we will not need it in our algorithm, we can access A ⊤ through matrixvector multiplications, though this would be substantially more expensive owing to the needed memory access pattern [14]. For all other operations, we make use of efficiently-computed (see Footnote 1), sketches of A, which we individually denote by (possibly with a subscript) ...
Preprint
Full-text available
As the scale of problems and data used for experimental design, signal processing and data assimilation grows, the oft-occuring least squares subproblems are correspondingly growing in size. As the scale of these least squares problems creates prohibitive memory movement costs for the usual incremental QR and Krylov-based algorithms, randomized least squares problems are garnering more attention. However, these randomized least squares solvers are difficult to integrate application algorithms as their uncertainty limits practical tracking of algorithmic progress and reliable stopping. Accordingly, in this work, we develop theoretically-rigorous, practical tools for quantifying the uncertainty of an important class of iterative randomized least squares algorithms, which we then use to track algorithmic progress and create a stopping condition. We demonstrate the effectiveness of our algorithm by solving a 0.78 TB least squares subproblem from the inner loop of incremental 4D-Var using only 195 MB of memory.
... For example, they fail to handle con ict misses and that leads to poor results for most practical cache organizations. Scienti c loops can su er heavily due to con ict misses Hennessy and Patterson 1996;Lam et al. 1991;McKinley and Temam 1996;Temam et al. 1994], thereby precluding e ective cache utilization. Con ict misses can be particularly signi cant in caches with low associativity. ...
... In such situations programmers often rely on time-consuming cache pro ling and performance tuning Lebeck and Wood 1994;Martonosi et al. 1992]. There has also been compiler work in tailoring code to reduce con ict misses Bacon et al. 1994;Coleman and McKinley 1995;Lam et al. 1991;Rivera and Tseng 1998]. Unfortunately, con ict misses are highly sensitive to slight variations in problem size and base addresses Bacon et al. 1994;Lam et al. 1991], and hence we need more precise characterization to understand the underlying cause behind such con ict misses. ...
... There has also been compiler work in tailoring code to reduce con ict misses Bacon et al. 1994;Coleman and McKinley 1995;Lam et al. 1991;Rivera and Tseng 1998]. Unfortunately, con ict misses are highly sensitive to slight variations in problem size and base addresses Bacon et al. 1994;Lam et al. 1991], and hence we need more precise characterization to understand the underlying cause behind such con ict misses. ...
Thesis
Full-text available
The large latency of accessing main memory in modern computer systems is a primary obstacle to improving system performance. With the ever-widening performance gap between processors and main memory, the use of cache memory to bridge this gap is becoming more and more significant. Caches work well for programs that exhibit sufficient locality. For many programs, however, reference patterns fail to exploit the cache, thereby suffering heavily from high memory latency. Program transformations or source-code changes can radically alter memory access patterns, significantly improving cache performance for many programs. Both hand-tuning and compiler optimization techniques are often used to transform codes to improve cache utilization. Unfortunately, cache conflicts are difficult to predict and estimate. Hence, effective transformations require detailed knowledge about the frequency and causes of cache misses in the code. This dissertation describes methods for generating, solving, and using Cache Miss Equations (CMEs) that give a detailed representation of cache behavior in loop-oriented scientific code. This thesis argues that the simple, precise characterization of cache misses, embodied in the CMEs, allows one to better understand the cause behind such misses, and helps reduce them in a methodical way. Our approach is implemented within the SUIF compiler framework, and extends traditional compiler reuse analysis to generate linear Diophantine equations that summarize each loop's memory behavior. While solving these equations is in general difficult, we show that is also unnecessary, as mathematical techniques for manipulating Diophantine equations allow us to relatively easily compute and/or reduce the number of possible solutions, here each solution corresponds to a potential cache miss. The mathematical precision of CMEs allows us to find true optimal solutions for transformations such as padding or loop tiling. The generality of CMEs also allows us to reason about interactions between transformations applied in concert. We have demonstrated how the CME-based optimizations can be more effective at reducing cache misses than other non-CME-based optimizations previously described in the compiler community. This dissertation also presents an efficient and effective compiler framework, based on CMEs, that is used to drive automated diagnosis and selection of cache optimizing transformations. This diagnosis framework is driven by a CME Table, a table of CME-solution counts. We demonstrate how the CME Table is used to precisely diagnose poor cache behavior. It is used to select appropriate program transformations including loop permutation, array padding, and loop tiling. Even though these transformations are not new, there currently is no direct way to decide which one of them or a sequence of them would possibly help alleviate the bad cache performance of a given loop nest. This dissertation provides techniques to directly suggest effective transformations based on symptoms of poor cache behavior. Overall, the CMEs represent a framework for analyzing detailed cache behavior that offers the generality and precision needed for effective compiler optimizations through fine-grain cause-effect analysis.
... In addition, accumulating results in registers as long as possible, so that it reduces the times of storing data into the cache memory. Loop unrolling can be combined with prefetching technique to process elements as much as the cache line contains [33][34][35][36]. Listing 1 shows solution steps of matrix multiplication algorithm with unrolling." ...
... Hence, both B and C are reused Nb times each time the data are brought. However, the performance is decreased if the ratio of memory fetches to numerical operations is increased [33]." ...
Article
Full-text available
This paper is focused on Intel Advanced Vector Extension (AVX) which has been borne of the modern developments in AMD processors and Intel itself. Said prescript processes a chunk of data both individually and altogether. AVX is supporting variety of applications such as image processing. Our goal is to accelerate and optimize square single-precision matrix multiplication from 2080 to 4512, i.e. big size ranges. Our optimization is designed by using AVX instruction sets, OpenMP parallelization, and memory access optimization to overcome bandwidth limitations. This paper is different from other papers by concentrating on several main technique and the results therein. Making parallel implementation guidelines of said algorithms, where the target architecture’s characteristics need to be taken into consideration when said algorithms are applied are presented. This work has a comparative study of using most popular compilers: Intel C++ compiler 17.0 over Microsoft Visual Studio C++ compiler 2015. Additionally, a comparative study between single-core and multicore platforms has been examined. The obtained results of the proposed optimized algorithms are achieved a performance improvement of 71%, 59%, and 56% for C = A.B, C = A.BT, and C = AT.B separately compared with results that are achieved by implementing the latest Intel Math Kernel Library 2017 SGEMV subroutines.
... To exploit data locality, we must use data before it gets evicted from the cache; ideally, data is not loaded into the cache more than once. Caching and predicting reuse is a complex process [16]; empirically, reuse does occur within sufficiently small iteration spaces. We therefore contrive small iteration spaces by partitioning the original space into smaller tiles (Figure 2.1). ...
... We wanted to experiment with tile sizes optimal under time-tiling, as they are not easily predicted from those optimal for spatial tiling or cache size [16]. Therefore, the extended Devito auto-tuner (Section 4.4) was used to explore as many plausible tile sizes combinations as possible. ...
Preprint
Finite-difference methods are widely used in solving partial differential equations. In a large problem set, approximations can take days or weeks to evaluate, yet the bulk of computation may occur within a single loop nest. The modelling process for researchers is not straightforward either, requiring models with differential equations to be translated into stencil kernels, then optimised separately. One tool that seeks to speed up and eliminate mistakes from this tedious procedure is Devito, used to efficiently employ finite-difference methods. In this work, we implement time-tiling, a loop nest optimisation, in Devito yielding a decrease in runtime of up to 45%, and at least 20% across stencils from the acoustic wave equation family, widely used in Devito's target domain of seismic imaging. We present an estimator for arithmetic intensity under time-tiling and a model to predict runtime improvements in stencil computations. We also consider generalisation of time-tiling to imperfect loop nests, a less widely studied problem.
... En outre, les heuristiques sont également utilisées pour trouver des ordres d'exécution de code optimisés et des paramètres d'optimisation de niveau système (Lam et al. 1991). Par exemple, l'heuristique de programmation dynamique peut être utilisée pour trouver l'ordre optimal d'exécution de code pour un programme. ...
Thesis
Full-text available
L'optimisation automatique du code est un domaine en constante évolution visant à améliorer les performances des programmes informatiques par des optimisations automatiques. Notre projet avait pour objectif d'intégrer l'autoscheduler de Tiramisu dans LPC, un compilateur Python utilisé pour la simulation de la physique des particules. Nous avons utilisé Tiramisu comme backend pour LPC afin d'optimiser automatiquement le code généré. Notre objectif principal était d'intégrer l'autoscheduler dans LPC en traitant les problèmes spécifiques aux applications LQCD. Nous avons développé des heuristiques et des correctifs pour améliorer la compatibilité et la scalabilité du système. Le résultat final est un environnement de compilation cohérent et efficace pour les développeurs d'applications LQCD, leur permettant de générer automatiquement du code optimisé adapté à leurs besoins spécifiques. Grâce au nouvel autoscheduler de Tiramisu, le code généré par LPC est automatiquement optimisé, améliorant ainsi les performances des applications LQCD et résolvant les problèmes de scalabilité. Notre approche a offert aux développeurs d'applications LQCD un environnement de compilation efficace, ouvrant de nouvelles perspectives en physique des particules.
... Ankita et al. [15] proposed a shallow underwater image enhancement algorithm called Shallow-UWnet, which employs dense connections to three connected convolutional networks and generates enhanced underwater images using the final convolution layer with three kernels. However, we believe that the structure of Shallow-UWnet can be simplified, and the use of dense connections can lead to excessively large memory access cost (MAC) [34][35][36][37]. Models that have lower MAC are typically preferred, because they necessitate less memory access, resulting in lower power and computational resource consumption. ...
Article
Full-text available
In this paper, we aim to design a lightweight underwater image enhancement algorithm that can effectively solve the problem of color distortion and low contrast in underwater images. Recently, enhancement methods typically optimize a perceptual loss function, using high-level features extracted from pre-trained networks to train a feed-forward network for image enhancement tasks. This loss function measures the perceptual and semantic differences between images, but it is applied globally across the entire image and does not consider semantic information within the image, which limits the effectiveness of the perceptual loss. Therefore, we propose an area contrast distribution loss (ACDL), which trains a flow model to achieve real-time optimization of the difference between output and reference in training. Additionally, we propose a novel lightweight neural network. Because underwater image acquisition is difficult, our experiments have shown that our model training can use only half the amount of data and half the image size compared to Shallow-UWnet. The RepNet network reduces the parameter size by at least 48% compared to previous algorithms, and the inference time is 5 times faster than before. After incorporating ACDL, SSIM increased by 2.70% and PSNR increased by 9.72%.
... The structure of the kernel is a set of nested for loops. We have applied loop blocking [7]. By carefully choosing the block-size, we can fine-tune the kernel parallelism to match the available resources of the FPGA. ...
Preprint
With their widespread availability, FPGA-based accelerators cards have become an alternative to GPUs and CPUs to accelerate computing in applications with certain requirements (like energy efficiency) or properties (like fixed-point computations). In this paper we show results and experiences from mapping an industrial application used for drug discovery on several types of accelerators. We especially highlight the effort versus benefit of FPGAs compared to CPUs and GPUs in terms of performance and energy efficiency. For this application, even with extensive use of FPGA-specific features, and performing different optimizations, results on GPUs are still better, both in terms of energy and performance.
... Loop tiling (also known as loop blocking or nesting) is a well-known technique for optimizing the locality of reference [92], especially on large, twodimensional data sets. Figure 10.1 conceptually illustrates tiling of skeleton lineages. ...
... Edge computing plays a big role in reducing networking pressure and service response time, improving the performance of low-power devices [15]. The authors of [16] introduce the response time as one of the attributes that play a role in users' satisfaction. Bangla chatbot uses a knowledge-based tree structure to reduce steps of finding an answer in the whole conversation [17]. ...
Article
Full-text available
Chatbot technologies have made our lives easier. To create a chatbot with high intelligence, a significant amount of knowledge processing is required. However, this can slow down the reaction time; hence, a mechanism to enable a quick response is needed. This paper proposes a cache mechanism to improve the response time of the chatbot service; while the cache in CPU utilizes the locality of references within binary code executions, our cache mechanism for chatbots uses the frequency and relevance information which potentially exists within the set of Q&A pairs. The proposed idea is to enable the broker in a multi-layered structure to analyze and store the keyword-wise relevance of the set of Q&A pairs from chatbots. In addition, the cache mechanism accumulates the frequency of the input questions by monitoring the conversation history. When a cache miss occurs, the broker selects a chatbot according to the frequency and relevance, and then delivers the query to the selected chatbot to obtain a response for answer. This mechanism showed a significant increase in the cache hit ratio as well as an improvement in the average response time.
... Pipelining of CNN pipelines is widespread [13], [25]; our contribution is finding the pipeline partitions that optimize offchip transfers. Tiling [9], [20], [26], [39], [40] is a well-explored area of research. Occam's optimal tile shape targets convolutional reuse. ...
Preprint
Convolutional neural networks (CNNs) are emerging as powerful tools for image processing in important commercial applications. We focus on the important problem of improving the latency of image recognition. CNNs' large data at each layer's input, filters, and output poses a memory bandwidth problem. While previous work captures only some of the enormous data reuse, full reuse implies that the initial input image and filters are read once from off chip and the final output is written once off chip without spilling the intermediate layers' data to off-chip. We propose Occam to capture full reuse via four contributions. (1) We identify the necessary condition for full reuse. (2) We identify the dependence closure as the sufficient condition to capture full reuse using the least on-chip memory. (3) Because the dependence closure is often too large to fit in on-chip memory, we propose a dynamic programming algorithm that optimally partitions a given CNN to guarantee the least off-chip traffic at the partition boundaries for a given on-chip capacity. Occam's partitions reside on different chips forming a pipeline so that a partition's filters and dependence closure remain on-chip as different images pass through (i.e., each partition incurs off-chip traffic only for its inputs and outputs). (4) because the optimal partitions may result in an unbalanced pipeline, we propose staggered asynchronous pipelines (STAP) which replicates the bottleneck stages to improve throughput by staggering the mini-batches across the replicas. Importantly, STAP achieves balanced pipelines without changing Occam's optimal partitioning. Our simulations show that Occam cuts off-chip transfers by 21x and achieves 2.06x and 1.36x better performance, and 33\% and 24\% better energy than the base case and Layer Fusion, respectively. On an FPGA implementation, Occam performs 5.1x better than the base case.
... ∇ Die Optimierung des Wiederverwendungsfaktors war das Thema umfassender Arbeiten. Die ursprünglichen Ansätze konzentrierten sich dabei auf die durch Kachelung erzielbaren Leistungsverbesserungen. Für Matrixmultiplikation wurde eine Leistungsverbesserung um einen Faktor von 3 bis 4,3 von Lam[321] berichtet. Weitere Verbesserungen ergeben sich bei größerem Abstand zwischen Prozessorund Speichergeschwindigkeiten. Kachelung kann zudem eingesetzt werden, um den Energieverbrauch von Speichersystemen zu senken[103]. ...
Book
Full-text available
Ein Alleinstellungsmerkmal dieses Open-Access-Lehrbuchs ist die umfassende Einführung in das Grundlagenwissen über eingebettete Systeme mit Anwendungen in cyber-physischen Systemen und dem Internet der Dinge. Es beginnt mit einer Einführung in das Gebiet und eine Übersicht über Spezifikationsmodelle und -sprachen für eingebettete und cyber-physikalische Systeme. Es gibt einen kurzen Überblick über die für solche Systeme verwendeten Hardware-Geräte und stellt die Grundlagen der Systemsoftware für eingebettete Systeme vor, einschließlich Echtzeit-Betriebssystemen. Der Autor erörtert auch Evaluierungs- und Validierungstechniken für eingebettete Systeme und gibt einen Überblick über Techniken zur Abbildung von Anwendungen auf Ausführungsplattformen, inklusive Multi-Core-Plattformen. Eingebettete Systeme müssen unter engen Randbedingungen arbeiten, daher enthält das Buch auch einen ausgewählten Satz von Optimierungstechniken, mit einem Schwerpunkt bei Software-Optimierungstechniken. Das Buch schließt mit einer kurzen Übersicht über das Testen. Die vierte Auflage wurde aktualisiert und überarbeitet, um neue Trends und Technologien zu berücksichtigen, wie z. B. die Bedeutung von cyber-physischen Systemen (CPS) und dem Internet der Dinge (IoT), die Entwicklung von Single-Core-Prozessoren hin zu Multi-Core-Prozessoren und die zunehmende Bedeutung von Energieeffizienz und thermischen Fragen. Der Inhalt Einleitung - Spezifikation und Modellierung - Hardware eingebetteter Systeme - Systemsoftware - Bewertung und Validierung - Abbildung von Anwendungen (Scheduling) - Optimierung - Test Der Autor Prof. Dr. Peter Marwedel erhielt die Grade eines Dr. rer. nat. in Physik und eines Dr. habil. in Informatik von der Universität Kiel. Von 1989–2014 leitete er den Lehrstuhl für Technische Informatik und Eingebettete Systeme an der Fakultät für Informatik der TU Dortmund. Sein Forschungsinteresse gilt der effizienten Realisierung eingebetteter und cyber-physikalischer Systeme. Den Sonderforschungsbereich 876 zur ressourceneffizienten Analyse großer Datensätze hat er mitinitiiert und er war dessen erster stellvertretender Vorsitzender.
... ∇ Die Optimierung des Wiederverwendungsfaktors war das Thema umfassender Arbeiten. Die ursprünglichen Ansätze konzentrierten sich dabei auf die durch Kachelung erzielbaren Leistungsverbesserungen. Für Matrixmultiplikation wurde eine Leistungsverbesserung um einen Faktor von 3 bis 4,3 von Lam[321] berichtet. Weitere Verbesserungen ergeben sich bei größerem Abstand zwischen Prozessorund Speichergeschwindigkeiten. Kachelung kann zudem eingesetzt werden, um den Energieverbrauch von Speichersystemen zu senken[103]. ...
Chapter
Full-text available
Zusammenfassung Zur Bewältigung der Komplexität des Entwurfes von Anwendungen eingebetteter Systeme ist die Wiederverwendung von Komponenten ein wichtiges Hilfsmittel.
... Utilization of blocked matrices or matrix blocking, which is the second strategy we adopt, is well known as one of the sound techniques for enhancing the performance of dense matrix multiplication, since it helps increase the cache hit ratio during the multiplication process [20]. The major question here is how to determine the block size with which target matrices subjected to multiplication are decomposed. ...
Article
Full-text available
The non-equilibrium Green’s function (NEGF) is being utilized in the field of nanoscience to predict transport behaviors of electronic devices. This work explores how much performance improvement can be driven for quantum transport simulations with the aid of manycore computing, where the core numerical operation involves a recursive process of matrix multiplication. Major techniques adopted for performance enhancement are data restructuring, matrix tiling, thread scheduling, and offload computing, and we present technical details on how they are applied to optimize the performance of simulations in computing hardware, including Intel Xeon Phi Knights Landing (KNL) systems and NVIDIA general purpose graphic processing unit (GPU) devices. With a target structure of a silicon nanowire that consists of 100,000 atoms and is described with an atomistic tight-binding model, the effects of optimization techniques on the performance of simulations are rigorously tested in a KNL node equipped with two Quadro GV100 GPU devices, and we observe that computation is accelerated by a factor of up to ∼20 against the unoptimized case. The feasibility of handling large-scale workloads in a huge computing environment is also examined with nanowire simulations in a wide energy range, where good scalability is procured up to 2048 KNL nodes.
... Paving the way towards the 5th Generation (5G) Telecom onto legacy needs evolutionary and revolutionary mobile system design, yet taking into account reduction in up-gradation migration cost [74,75,76]. User equipment seamlessly churn the underlying 5G mobility architecture benefits. ...
Article
Full-text available
: Poised to smart citizenry engagement, an unprecedented deluge of high quality streaming services induce a major data traffic challenge in Fourth Generation (4G) bandwidth and coverage, the upcoming smart city expectations cannot be ignored, eventually. The bottlenecks in ever exploiting benefits such as e-Life, to name a few at Mobile Equipment Providers (MEP) tiers, are complemented at Device Dependent Device Independent (D3I) configurations inherent at various tiers of Mobile Service Providers (MSP). While enabling a Supply Chain Management (SCM) augments a unique system support involving the MSPs and MEPs for desired Customer Relationship Management (CRM), Ad hoc Resource Planning (ARP), which we found prevalent in migration scenario from 4G to 5G technology deployment. Despite its complexity both in term of one-to-many and many-to-one across diverse MSP and MEP options, SCM operational objectives sets forth a unique challenge, hence is the main objective for our work presented here. In this paper, we presented a framework to enhance the 4G legacy in mobile service provider capacity for smart city Machine-to-Machine (M2M) backbone. The migration process is assessed with proposed strategic, technical and operational indicators, which demonstrate its adaptability and flexibility while integrating in conventional 4G deployments, especially taking into account radio devices and applications. Web-enabled Software Define Radio devices and applications are used to index the migration cost and support SCM planning and execution. We identify, the decentralization of mobile service providers infrastructure plays a major role in reducing the embedded complexity which often appears as primary bottleneck. MSP as a key player in the elasticity of migration, we presented a platform to support large as well as low MEP-MSP co-deployments. Pareto multi-criteria optimization is used to find the strategic indicators which are primary Transformation Steering Factors (TSF), valid in both device dependent or device independent M2M migration. We expose our result for achieving TSF, while rolling interoperability and reconfiguration of device deployed in a typical volatile inter-MSP or intra-MSPs tiers. Pareto Migration Indicators (MI) are optimized successively progressing across the transformation schemes; relative to base-line MSP services, hence enabling a lucrative choice while elasticity of provider-centric cost depends adaptively on technology legacy and M2M access of User Equipment (UE).
... Device Interoperability is the scheduling equivalent of an extended basic UE block similar to cache blocks on L1 or L2 level of Embedded Systems as mentioned in Lam et. al, (1991). It is a cell region that has a single entry point and zero or more exit edges leading to other device interoperability trees in term of SDR function utilization. ...
Article
Full-text available
Despite the ever exploiting benefits in Fourth Generation (4G) Telecom such as e-Life from Etisalat-UAE, vendors’ economical limitation for providing high quality streaming services in limited bandwidth and coverage expectation cannot be ignored. A unique systems support involving the Customer Relationship, Resource Planning, and Supply Chains exist in both Mobile Service Providers and Mobile Equipment Providers. Despite its complexity, the operational objectives of SCM planning may vary greatly from MSP to MSP and MEP to MEP, hence is the main objective for our work presented here. In this paper, we explore the strategic, technical and operational indicators, while migrating legacy 4G MSP deployments into 5G to conform the e-governess needs.
... Observing this, many prior research papers considered eicient management of data movements across various layers in cache-memory-storage hierarchy. The proposed techniques include data locality optimizations in software [17,31,36,37,39,56,61,64] and hardware [13,28,33,46,59,60,70], careful design and management of cache and memory hierarchies [13,54,69,71], as well as proposals oriented towards taking advantage of memory-level parallelism [38,43,53,58]. ...
Article
Full-text available
The cost of moving data between compute elements and storage elements plays a signiicant role in shaping the overall performance of applications.We present a compiler-driven approach to reducing data movement costs. Our approach, referred to as Computing with Near Data (CND), is built upon a concept called 'recomputation', in which a costly data access is replaced by a few less costly data accesses plus some extra computation, if the cumulative cost of the latter is less than that of the costly data access. Experimental result reveals that i) the average recomputability across our benchmarks is 51.1%, ii) our compiler-driven strategy is able to exploit 79.3% of the recomputation opportunities presented by our workloads, and iii) our enhancements increase the value of the recomputability metric signiicantly.
... Observing this, many prior research papers considered eicient management of data movements across various layers in cache-memory-storage hierarchy. The proposed techniques include data locality optimizations in software [17,31,36,37,39,56,61,64] and hardware [13,28,33,46,59,60,70], careful design and management of cache and memory hierarchies [13,54,69,71], as well as proposals oriented towards taking advantage of memory-level parallelism [38,43,53,58]. ...
Conference Paper
Full-text available
The cost of moving data between compute elements and storage elements plays a significant role in shaping the overall performance of applications. We present a compiler-driven approach to reducing data movement costs. Our approach, referred to as Computing with Near Data (CND), is built upon a concept called "recomputation", in which a costly data access is replaced by a few less costly data accesses plus some extra computation, if the cumulative cost of the latter is less than that of the costly data access. Experimental result reveals that i) the average recomputability across our benchmarks is 51.1%, ii) our compiler-driven strategy is able to exploit 79.3% of the recomputation opportunities presented by our workloads, and iii) our enhancements increase the value of the recomputability metric significantly.
... This is especially the case in floating-point operations. For example, when performing a matrix multiplication, active matrix row and column values (or a subset of them) are typically stored in floating-point registers (Lam et al. 1991). Therefore, the DPU-REGBANK, which is typically involved in more immediate computations, shows significantly shorter error manifestation time than the DPU-FREGBANK. ...
Article
Full-text available
The Arm Triple Core Lock-Step (TCLS) architecture is the natural evolution of Arm Cortex-R Dual Core Lock-Step (DCLS) processors to increase dependability, predictability, and availability in safety-critical and ultra-reliable applications. TCLS is simple, scalable, and easy to deploy in applications where Arm DCLS processors are widely used (e.g., automotive), as well as in new sectors where the presence of Arm technology is incipient (e.g., enterprise) or almost non-existent (e.g., space). Specifically in space, COTS Arm processors provide optimal power-to-performance, extensibility, evolvability, software availability, and ease of use, especially in comparison with the decades old rad-hard computing solutions that are still in use. This article discusses the fundamentals of an Arm Cortex-R5 based TCLS processor, providing key functioning and implementation details. The article shows that the TCLS architecture keeps the use of rad-hard technology to a minimum, namely, using rad-hard by design standard cell libraries only to protect the critical parts that account for less than 4% of the entire TCLS solution. Moreover, when exposure to radiation is relatively low, such as in terrestrial applications or even satellites operating in Low Earth Orbits (LEO), the system could be implemented entirely using commercial cell libraries, relying on the radiation mitigation methods implemented on the TCLS to cope with sporadic soft errors in its critical parts. The TCLS solution allows thus to significantly reduce chip manufacturing costs and keep pace with advances in low power consumption and high density integration by leveraging commercial semiconductor processes, while matching the reliability levels and improving availability that can be achieved using extremely expensive rad-hard semiconductor processes. Finally, the article describes a TRL4 proof-of-concept TCLS-based System-on-Chip (SoC) that has been prototyped and tested to power the computer on-board an Airbus Defence and Space telecom satellite. When compared to the currently used processor solution by Airbus, the TCLS-based SoC results in a more than 5× performance increase and cuts power consumption by more than half.
... Observing this, many prior research papers considered eicient management of data movements across various layers in cache-memory-storage hierarchy. The proposed techniques include data locality optimizations in software [17,31,36,37,39,56,61,64] and hardware [13,28,33,46,59,60,70], careful design and management of cache and memory hierarchies [13,54,69,71], as well as proposals oriented towards taking advantage of memory-level parallelism [38,43,53,58]. ...
Article
Full-text available
One cost that plays a significant role in shaping the overall performance of both single-threaded and multi-thread applications in modern computing systems is the cost of moving data between compute elements and storage elements. Traditional approaches to address this cost are code and data layout reorganizations and various hardware enhancements. More recently, an alternative paradigm, called Near Data Computing (NDC) or Near Data Processing (NDP), has been shown to be effective in reducing the data movements costs, by moving computation to data, instead of the traditional approach of moving data to computation. Unfortunately, the existing Near Data Computing proposals require significant modifications to hardware and are yet to be widely adopted. In this paper, we present a software-only (compiler-driven) approach to reducing data movement costs in both single-threaded and multi-threaded applications. Our approach, referred to as Computing with Near Data (CND), is built upon a concept called "recomputation," in which a costly data access is replaced by a few less costly data accesses plus some extra computation, if the cumulative cost of the latter is less than that of the costly data access. If implemented carefully, CND can successfully trade off data access with computation, and considering the continuously increasing latency gap between the two, doing so can significantly reduce the execution latencies of both sequential and parallel application programs. We i) quantify the intrinsic recomputability of a set of single-threaded and multi-threaded applications, ii) propose a practical, compiler-driven approach that automatically transforms a given application code fragment to a version that employs recomputation, iii) discuss an optimization strategy that increases recomputability; and iv) compare CND, both qualitatively and quantitatively, against NDC. Our experimental analysis of CND reveals that i) the average recomputability across our benchmarks is 51.1%, ii) our compiler-driven strategy is able to exploit 79.3% of the recomputation opportunities presented by our workloads, and iii) our enhancements increase the value of the recomputability metric significantly. As a result, our compiler-driven approach with the proposed enhancements brings an average execution time improvement of 40.1%.
... Based on how they affect the number of cache misses and the incurred penalty, these techniques fall into one of the following three categories: -Eliminate cache misses by increasing spatial and temporal locality. Locality is increased by (a) eliminating indirection with cache-conscious data structure designs like the CSB + -tree [19]; (b) matching the data layout to the access pattern of the algorithm, i.e., store data that are accessed together in contiguous space; or (c) reorganizing memory accesses to increase locality, e.g., with array [24] and tree blocking [25,26]. In this work, we assume that the index has the best possible implementation and locality cannot be further increased without penalizing single lookups. ...
Article
Full-text available
Index joins present a case of pointer-chasing code that causes data cache misses. In principle, we can hide these cache misses by overlapping them with computation: The lookups involved in an index join are parallel tasks whose execution can be interleaved, so that, when a cache miss occurs in one task, the processor executes independent instructions from another one. Yet, the literature provides no concrete performance model for such interleaved execution and, more importantly, production systems still waste processor cycles on cache misses because (a) hardware and compiler limitations prohibit automatic task interleaving and (b) existing techniques that involve the programmer produce unmaintainable code and are thus avoided in practice. In this paper, we address these shortcomings: we model interleaved execution explaining how to estimate the speedup of any interleaving technique, and we propose interleaving with coroutines, i.e., functions that suspend their execution for later resumption. We deploy coroutines on index joins running in SAP HANA and show that interleaving with coroutines performs like other state-of-the-art techniques, retains close resemblance to the original code, and supports both interleaved and non-interleaved execution in the same implementation. Thus, we establish the first systematic and practical approach for interleaving index joins of any type.
... Ωe believe that the OS-based and compilerbased solutions are complementary and can be used together. Software-Based Proposals: A number of previous compiler works [12,15,26,27,32,33,38,39,49,56,60,62] explored the ways to improve data locality (cache performance). The main goal of these works is to achieve either temporal reuse or unit-stride (or small-stride) accesses for as many array references as possible so that the access pattern of data can be aligned with its memory layout. ...
Article
Going beyond a certain number of cores in modern architectures requires an on-chip network more scalable than conventional buses. However, employing an on-chip network in a manycore system (to improve scalability) makes the latencies of the data accesses issued by a core non-uniform. This non-uniformity can play a significant role in shaping the overall application performance. This work presents a novel compiler strategy which involves exposing architecture information to the compiler to enable an optimized computation-to-core mapping. Specifically, we propose a compiler-guided scheme that takes into account the relative positions of (and distances between) cores, last-level caches (LLCs) and memory controllers (MCs) in a manycore system, and generates a mapping of computations to cores with the goal of minimizing the on-chip network traffic. The experimental data collected using a set of 21 multi-threaded applications reveal that, on an average, our approach reduces the on-chip network latency in a 6×6 manycore system by 38.4% in the case of private LLCs, and 43.8% in the case of shared LLCs. These improvements translate to the corresponding execution time improvements of 10.9% and 12.7% for the private LLC and shared LLC based systems, respectively.
... It is based on using some optimization techniques like Blocking and parallelization of algorithms on multi-core systems. Blocking is a popular and well-proven optimization technique to improve the performance of matrix multiplication [7]. An overview of common optimization techniques for matrix multiplication on multi-core processor for optimized performance is discussed in [8]. ...
Article
Full-text available
Deep Neural Network training algorithms consumes long training time, especially when the number of hidden layers and nodes is large. Matrix multiplication is the key operation carried out at every node of each layer for several hundreds of thousands of times during the training of Deep Neural Network. Blocking is a well-proven optimization technique to improve the performance of matrix multiplication. Blocked Matrix multiplication algorithms can easily be parallelized to accelerate the performance further. This paper proposes a novel approach of implementing Parallel Blocked Matrix multiplication algorithms to reduce the long training time. The proposed approach was implemented using a parallel programming model OpenMP with collapse() clause for the multiplication of input and weight matrices of Backpropagation and Boltzmann Machine Algorithms for training Deep Neural Network and tested on multi-core processor system. Experimental results showed that the proposed approach achieved approximately two times speedup than classic algorithms. © 2018 Institute of Advanced Engineering and Science. All rights reserved.
... Ωe believe that the OS-based and compilerbased solutions are complementary and can be used together. Software-Based Proposals: A number of previous compiler works [12,15,26,27,32,33,38,39,49,56,60,62] explored the ways to improve data locality (cache performance). The main goal of these works is to achieve either temporal reuse or unit-stride (or small-stride) accesses for as many array references as possible so that the access pattern of data can be aligned with its memory layout. ...
Conference Paper
Going beyond a certain number of cores in modern architectures requires an on-chip network more scalable than conventional buses. However, employing an on-chip network in a manycore system (to improve scalability) makes the latencies of the data accesses issued by a core non-uniform. This non-uniformity can play a significant role in shaping the overall application performance. This work presents a novel compiler strategy which involves exposing architecture information to the compiler to enable an optimized computation-to-core mapping. Specifically, we propose a compiler-guided scheme that takes into account the relative positions of (and distances between) cores, last-level caches (LLCs) and memory controllers (MCs) in a manycore system, and generates a mapping of computations to cores with the goal of minimizing the on-chip network traffic. The experimental data collected using a set of 21 multi-threaded applications reveal that, on an average, our approach reduces the on-chip network latency in a 6×6 manycore system by 38.4% in the case of private LLCs, and 43.8% in the case of shared LLCs. These improvements translate to the corresponding execution time improvements of 10.9% and 12.7% for the private LLC and shared LLC based systems, respectively.
... Frigo et al. in [FLP + 99], discusses the cache performance of cache oblivious algorithms for the matrix transpose, FFT, and sorting. Optimizing blocked algorithms has been extensively studied [LRW91]. ...
Thesis
Full-text available
This thesis aims at developing strategies to enhance the power of sequential computation and distributed systems, particularly, it deals with sequential break down of operations and decentralized collaborative editing systems. In this thesis, we introduced precision control indexing method that generates unique identifiers which are used for indexed communication in distributed systems, particularly, in decentralized collaborative editing systems. These identifiers are still real numbers with a specific controlled pattern of precision. Set of identifiers is kept finite that makes it possible to compute local as well as global cardinality. This property plays important role in dealing with indexed communication. Besides this, some other properties including order preservation are observed. The indexing method is tested and verified by experimentation successfully and it leads to design decentralized collaborative editing system. Dealing with sequential break down of operations, we explore limitations of the existing strategies, extended the idea by introducing new strategies. These strategies lead towards optimization (processor, compiler, memory, code). This style of decomposition attracts research communities for further investigation and practical implementation that could lead towards designing an arithmetic unit.
... To fit the alignment of data in memory, the innermost loop iterates in x-direction, the outermost loop iterates in z-direction. Blocking [51] is applied to increase the cache efficiency of the CPU code. Since the LBM is memory-bound, cache efficiency is of high relevance. ...
Article
Full-text available
Heterogeneous clusters are a widely utilized class of supercomputers assembled from different types of computing devices, for instance CPUs and GPUs, providing a huge computational potential. Programming them in a scalable way exploiting the maximal performance introduces numerous challenges such as optimizations for different computing devices, dealing with multiple levels of parallelism, the application of different programming models, work distribution, and hiding of communication with computation. We utilize the lattice Boltzmann method for fluid flow as a representative of a scientific computing application and develop a holistic implementation for large-scale CPU/GPU heterogeneous clusters. We review and combine a set of best practices and techniques ranging from optimizations for the particular computing devices to the orchestration of tens of thousands of CPU cores and thousands of GPUs. Eventually, we come up with an implementation using all the available computational resources for the lattice Boltzmann method operators. Our approach shows excellent scalability behavior making it future-proof for heterogeneous clusters of the upcoming architectures on the exaFLOPS scale. Parallel efficiencies of more than 90 % are achieved leading to 2604.72 GLUPS utilizing 24,576 CPU cores and 2048 GPUs of the CPU/GPU heterogeneous cluster Piz Daint and computing more than 6.8 × 10 9 lattice cells.
Conference Paper
We use a variety of techniques to optimize the single-threaded operation of: C = C + A * B. A number of optimization techniques yielded significant speedups, including multi-level blocking, copy optimizations, and loop adjustments. However, we also tried a number of other optimizations that did not improve our program, including prefetching. In this report, we describe each of our optimizations in our final submission, present results with evidence that they work, and describe attempted optimizations that did not noticeably improve our overall performance. Unless otherwise noted, all of our benchmarks operate on matrices of size 1024\(\,\times \,\)1024. In our attempts at optimizing matrix multiplication, we find that taking full advantage of spatial locality (through techniques like loop rearrangement or copy optimizations) and instruction-level parallelism dramatically improves the performance of our operation. Furthermore, techniques like prefetching can be a double-edged sword; it can help performance when used correctly but can also slow down performance. Finally, we find that the compiler adds optimizations out-of-the-box, like loop-unrolling, that help with performance underneath.
Article
Full-text available
A particle-based cloud model was developed for meter- to submeter-scale-resolution simulations of warm clouds. Simplified cloud microphysics schemes have already made meter-scale-resolution simulations feasible; however, such schemes are based on empirical assumptions, and hence they contain huge uncertainties. The super-droplet method (SDM) is a promising candidate for cloud microphysical process modeling and is a particle-based approach, making fewer assumptions for the droplet size distributions. However, meter-scale-resolution simulations using the SDM are not feasible even on existing high-end supercomputers because of high computational cost. In the present study, we overcame challenges to realize such simulations. The contributions of our work are as follows: (1) the uniform sampling method is not suitable when dealing with a large number of super-droplets (SDs). Hence, we developed a new initialization method for sampling SDs from a real droplet population. These SDs can be used for simulating spatial resolutions between meter and submeter scales. (2) We optimized the SDM algorithm to achieve high performance by reducing data movement and simplifying loop bodies using the concept of effective resolution. The optimized algorithms can be applied to a Fujitsu A64FX processor, and most of them are also effective on other many-core CPUs and possibly graphics processing units (GPUs). Warm-bubble experiments revealed that the throughput of particle calculations per second for the improved algorithms is 61.3 times faster than those for the original SDM. In the case of shallow cumulous, the simulation time when using the new SDM with 32–64 SDs per cell is shorter than that of a bin method with 32 bins and comparable to that of a two-moment bulk method. (3) Using the supercomputer Fugaku, we demonstrated that a numerical experiment with 2 m resolution and 128 SDs per cell covering 13 8242×3072 m3 domain is possible. The number of grid points and SDs are 104 and 442 times, respectively, those of the highest-resolution simulation performed so far. Our numerical model exhibited 98 % weak scaling for 36 864 nodes, accounting for 23 % of the total system. The simulation achieves 7.97 PFLOPS, 7.04 % of the peak ratio for overall performance, and a simulation time for SDM of 2.86×1013 particle ⋅ steps per second. Several challenges, such as incorporating mixed-phase processes, inclusion of terrain, and long-time integrations, remain, and our study will also contribute to solving them. The developed model enables us to study turbulence and microphysics processes over a wide range of scales using combinations of direct numerical simulation (DNS), laboratory experiments, and field studies. We believe that our approach advances the scientific understanding of clouds and contributes to reducing the uncertainties of weather simulation and climate projection.
Article
Batched Sparse codes (BATS codes) are a class of linear network coding schemes that increase network throughput by converting a multi-hop problem into an end-to-end problem through network coding. It plays a crucial role in future wireless communication, where packet loss is inevitable. This is an enduring problem in many applications, including vehicular networks. While the theory of BATS codes has been developed over the past decade, little research has been done on its hardware implementation, which is important for practical adoption. This paper provides a systematic way to analyze the performance of a BATS hardware accelerator when the code varies. A roofline model which provides an upper bound of the performance considering both computational and input/output constraints is developed. Next, we build a model connecting the BATS code design space with the hardware execution time, which can be used to determine an optimal code. Finally, we propose and test a flexible and scalable BATS code design paradigm for hardware accelerators with numerical results. The same framework could easily be extended to other linear codes and different computing systems.
Article
The field of VLSI design faces the challenge of identifying the possible designs that simultaneously optimize the multiple conflicting objectives such as area, latency, and power. The applications involving high-level synthesis (HLS) tend to engender huge design space due to the pragmas applied in the behavioral description. Therefore, the exhaustive search of Pareto-optimal designs in such HLS applications becomes infeasible due to the computing cost involved. To address this challenge, we propose N-PIR, a Neighborhood-based Pareto iterative refinement approach for design space exploration (DSE) that forms the neighborhoods within the underlying design space by applying an active learning technique and assigns them the ranks. The neighborhoods are explored in the order of their rank for predicting Pareto optimal solutions. The key features of N-PIR include (1) neighborhood creation within the design space through an active learning strategy guided by an uncertainty threshold parameter \(u\); (2) rank assignment to the neighborhoods based on a ranking score; and (3) exploration of neighborhoods in prioritized order (highest-ranked first) for the identification of Pareto optimal designs. N-PIR can produce a better level of accuracy than IRF-TED and IRF-rand by spending the same amount of evaluations. Also, to reach the same level of accuracy, there is a significant amount of savings (range of 7.84% to 46.25%) in the number of evaluations. Furthermore, our experiments show that N-PIR is at par with lattice-traversing and can outperform the cluster-based heuristic and Q-PIR within 10% and 25% of the synthesis budget, respectively.
Article
Full-text available
In many scenarios, particularly scientific AI applications, algorithm engineers widely adopt more complex convolution, e.g. 3D CNN, to improve the accuracy. Scientific AI applications with 3D-CNN, which tends to train with volumetric datasets, substantially increase the size of the input, which in turn potentially restricts the channel sizes (e.g. less than 64) under the constraints of limited device memory capacity. Since existing convolution implementations tend to split and parallelize computing the small channel convolution from channel dimension, they usually cannot fully exploit the performance of GPU accelerator, in particular that configured with the emerging tensor core. In this work, we target on enhancing the performance of small channel 3D convolution on the GPU platform configured with tensor cores. Our analysis shows that the channel size of convolution has a great effect on the performance of existing convolution implementations, that are memory-bound on tensor core. By leveraging the memory hierarchy characteristics and the WMMA API of tensor core, we propose and implement holistic optimizations for both promoting the data access efficiency and intensifying the utilization of computing units. Experiments show that our implementation can obtain 1.1x-5.4x speedup comparing to the cuDNN’s implementations for the 3D convolutions on different GPU platforms. We also evaluate our implementations on two practical scientific AI applications and observe up to 1.7x and 2.0x overall speedups compared with using cuDNN on V100 GPU.
Chapter
The process of automatic program optimization and parallelization is considered in this chapter. The problem is very important especially considering speed of computations, used resources, power consumption that depends on system elements activity and others. This is very important for mobile, embedded and standalone systems, especially those based on multicore microprocessors. Huge potential for parallelization is concentrated in cyclic sections of algorithms, since that main part of computations is performed here. In this chapter the authors consider the way of automatic parallelization of loops that use an iterative process to find out an optimal solution. As a core of this process Particle Swarm Optimization (PSO) Method is used. It gives a solution that is close to optimal one and moreover it reduces time of optimal parallel program development. The authors show how to parallelize iterative space of program loops by using PSO Method in cycling process. This method was named Smart Tiling Method. Some experiments show the positive effect on execution time of parallelized program by Smart Tiling Method.
Book
Full-text available
Poised to smart citizenry engagement, an unprecedented deluge of high quality streaming services induce a major data traffic challenge in Fourth Generation (4G) bandwidth and coverage, the upcoming smart city expectations cannot be ignored, eventually. The bottlenecks in ever exploiting benefits such as e-Life, to name a few at Mobile Equipment Providers (MEP) tiers, are complemented at Device Dependent Device Independent (D3I) configurations inherent at various tiers of Mobile Service Providers (MSP). While enabling a Supply Chain Management (SCM) augments a unique system support involving the MSPs and MEPs for desired Customer Relationship Management (CRM), Ad hoc Resource Planning (ARP), which we found prevalent in migration scenario from 4G to 5G technology deployment. Despite its complexity both in term of one-to-many and many-to-one across diverse MSP and MEP options, SCM operational objectives sets forth a unique challenge, hence is the main objective for our work presented here. In this book, we presented a framework to enhance the 4G legacy in mobile service provider capacity for smart city Machine-to- Machine (M2M) backbone. The migration process is assessed with proposed strategic, technical and operational indicators, which demonstrate its adaptability and flexibility while integrating in conventional 4G deployments, especially taking into account radio devices and applications. Web-enabled Software Define Radio devices and applications are used to index the migration cost and support SCM planning and execution. We identify, the decentralization of mobile service providers infrastructure plays a major role in reducing the embedded complexity which often appears as primary bottleneck. MSP as a key player in the elasticity of migration, we presented a platform to support large as well as low MEP-MSP co-deployments. Pareto multi-criteria optimization is used to find the strategic indicators which are primary Transformation Steering Factors (TSF), valid in both device dependent or device independent M2M migration. We expose our result for achieving TSF, while rolling interoperability and reconfiguration of device deployed in a typical volatile inter-MSP or intra-MSPs tiers. Pareto Migration Indicators (MI) are optimized successively progressing across the transformation schemes; relative to base-line MSP services, hence enabling a lucrative choice while elasticity of provider-centric cost depends adaptively on technology legacy and M2M access of User Equipment (UE).
Chapter
Stencils computations are a class of computations commonly found in scientific and engineering applications. They have relatively lower arithmetic intensity. Therefore, their performance is greatly affected by memory access. This paper studies the issue of memory access optimization for the key stencil computations of a high-order CFD program on the NVidia GPU. Two methods are used to optimize the performance. First, we use registers to cache the data used by the stencil computations in the kernel. We use the CUDA warp shuffle functions to exchange data between neighboring grid points, and adjust the thread computation granularity to increase the data reuse. Second, we use the shared memory to buffer the grid data used by the stencil computations in the kernel, and utilize loop tiling to reduce redundant accesses to the global memory. Performance evaluation is done on an NVidia Tesla K80 GPU. The results show that compared to the original implementation that only uses the global memory, the optimized implementation that utilizes the registers achieves a maximum speedup of 2.59 and 2.79 relatively for 15M and 60M grids, and the optimized implementation that utilizes the shared memory achieves a maximum speedup of 3.51 and 3.36 relatively for 15M and 60M grids.
Chapter
Full-text available
Embedded systems have to be efficient (at least) with respect to the objectives considered in this book. In particular, this applies to resource-constrained mobile systems, including sensor networks embedded in the Internet of Things. In order to achieve this goal, many optimizations have been developed. Only a small subset of those can be mentioned in this book. In this chapter, we will present a selected set of such optimizations. This chapter is structured as follows: first of all, we will present some high-level optimization techniques, which could precede compilation of source code or could be integrated into it. We will then describe concurrency management for tasks. Section 7.3 comprises advanced compilation techniques. The final Sect. 7.4 introduces power and thermal management techniques.
Preprint
Full-text available
Large scale modeling and simulation problems, from nanoscale materials to universe-scale cosmology, have in the past used the massive computing resources of High-Performance Computing (HPC) systems. Over the last decade, cloud computing has gained popularity for business applications and increasingly for computationally intensive machine learning problems. Despite the prolific literature, the question remains open whether cloud computing can provide HPC-competitive performance for a wide range of scientific applications. The answer to this question is crucial in guiding the design of future systems and providing access to high-performance resources to a broadened community. Here we present a multi-level approach to identifying the performance gap between HPC and cloud computing and to isolate several variables that contribute to this gap by dividing our experiments into (i) hardware and system microbenchmarks and (ii) user applications. Our results show that today's high-end cloud computing can deliver HPC-like performance - at least at modest scales - not only for computationally intensive applications, but also for memory- and communication-intensive applications, thanks to the high-speed memory systems and interconnects and dedicated batch scheduling now available on some cloud platforms.
Chapter
The cloud is emerging for scalable and efficient cloud services. In order to meet the needs of handling massive data and decreasing data migration, the computation infrastructure requires efficient data placement and proper management for cached data. We propose an efficient and cost-effective multilevel caching scheme, called MERCURY, as computation infrastructure of the cloud. The idea behind MERCURY is to explore and exploit data similarity and support efficient data placement. In order to accurately and efficiently capture the data similarity, we leverage low-complexity Locality-Sensitive Hashing (LSH). In our design, in addition to the problem of space inefficiency, we identify that a conventional LSH scheme also suffers from the problem of homogeneous data placement. To address these two problems, we design a novel Multicore-enabled LSH (MC-LSH) that accurately captures the differentiated similarity across data. The similarity-aware MERCURY hence partitions data into L1 cache, L2 cache, and main memory based on their distinct localities, which help optimize cache utilization and minimize the pollution in the last-level cache. Besides extensive evaluation through simulations, we also implemented MERCURY in a system. Experimental results based on real-world applications and datasets demonstrate the efficiency and efficacy of our proposed schemes (©{2014}IEEE. Reprinted, with permission, from Ref. [1].).
Article
We describe a model that enables us to analyze the run-nine time of an algorithm in a computer with a memory hierarchy with limited associativity, in terms of various cache parameters. Our model, an extension of Aggarwal and Vitter's I/O model, enables us to establish useful relationships between the cache complexity and the I/O complexity of computations. As a corollary, we obtain cache-optimal algorithms for fundamental problems like sorting, FFT, and an important subclass of permutations in the single-level cache model. We also show that ignoring associativity concerns could lead to inferior performance, by analyzing the average-case cache behavior of mergesort. Our techniques may be used for systematic exploitation of the memory hierarchy at the algorithm design stage, and dealing with the hitherto unresolved problem of limited associativity.
Article
Industrial Baltic sea water dynamics modelling program optimization and parallelization is described. Program is based on solving the system of partial differential equations of shallow water with numerical methods. Mechanical approach to program modernization is demonstrated involving building module dependency graph and rewriting every module in specific order. To achieve desired speed-up the program is translated into another language and several key optimization methods are used, including parallelization of most time-consuming loop nests. The theory of optimizing and parallelizing program transformations is used to achieve best performance boost with given amount of work. The list of applied program transformations is presented along with achieved speed-up for most time-consuming subroutines. Entire program speed-up results on shared memory computer system are presented.
Chapter
Full-text available
GPUs have as of late pulled in the consideration of numerous application designers as product information parallel coprocessors. The most current eras of GPU design give less demanding programmability and expanded all-inclusive statement while keeping up the gigantic memory data transfer capacity and computational force of conventional GPUs. This open door ought to divert endeavors in GPU examination to setting up standards and systems that permit proficient mapping of calculation to design equipment. The project, shows the GeForce GTX 560 Ti processors association, highlights, and summed up improvement systems. Method to execution on the platform is by utilizing gigantic multithreading and use vast quantity of centers, cover up global storage inactivity. In order to achieve it, designers confront the test of striking the right harmony between every string’s asset utilization and the quantity of all the while dynamic strings. The assets to oversee incorporate the quantity of resistors also the degree of on-chip storage utilized per string, given strings per multiprocessor, also worldwide memory transmission capacity. The researcher likewise get expanded execution on rearranging, gets to off-chip storage and join solicitations for similar else adjoining storage areas therefore, implement established enhancements by diminishing quantity of implemented function. Such methodologies are used over an assortment of utilizations and areas and accomplish between a 10.5X to 14X application speedup. The similar result was achieved with the single core GPU using deep learning technique in TensorFlow framework.
Article
Full-text available
This paper describes an extension to the set of Basic Linear Algebra Subprograms. The extensions are targeted at matrix-vector operations that should provide for efficient and portable implementations of algorithms for high-performance computers
Conference Paper
Full-text available
Most conventional compilers fail to allocate array elements to registers because standard data-flow analysis treats arrays like scalars, making it impossible to analyze the definitions and uses of individual array elements. This deficiency is particularly troublesome for floating-point registers, which are most often used as temporary repositories for subscripted variables. In this paper, we present a source-to-source transformation, called scalar replacement , that finds opportunities for reuse of subscripted variables and replaces the references involved by references to temporary scalar variables. The objective is to increase the likelihood that these elements will be assigned to registers by the coloring-based register allocators found in most compilers. In addition, we present transformations to improve the overall effectiveness of scalar replacement and show how these transformations can be applied in a variety of loop nest types. Finally, we present experimental results showing that these techniques are extremely effective—capable of achieving integer factor speedups over code generated by good optimizing compilers of conventional design.
Article
Full-text available
In this paper the authors describe a method for using data dependence analysis to estimate cache and local memory demand in highly iterative scientific codes. The estimates take the form of a family of reference windows for each variable that reflects the current set of elements that should be kept in cache. It is shown that, in important special cases, the authors can estimate the size of the window and predict a lower bound on the number of cache hits. If the machine has local memory or cache that can be managed by the compiler, these estimates can be used to guide the management of this resource. It is also shown that these estimates can be used to guide program transformation in an attempt to optimize cache performance.
Thesis
Full-text available
Measurements of actual supercomputer cache performance has not been previously undertaken. PFC-Sim is a program-driven event tracing facility that can simulate data cache performance of very long programs. PFC-Sim simulates cache concurrently with program execution, allowing very long traces to be used. Programs with traces in excess of 4 billion entries have been used to measure the performance of various cache structures. PFC-Sim was used to measure the cache performance of array references in a benchmark set of supercomputer applications, RiCEPS. Data cache hit ratios varied on average between 70% for a 16K cache and 91% for a 256K cache. Programs with very large working sets generate poor cache performance even with large caches. The hit ratios of individual references are measured to either 0% or 100%. By locating the references that miss, attempts to improve memory performance can focus on references where improvement is possible. The compiler can estimate the number of loop iterations which can execute without filling the cache, the overflow iteration. The overflow iteration combined with the dependence graph can be used to determine at each reference whether execution will result in hits or misses. Program transformation can be used to improve cache performance by reordering computation to move references to the same memory location closer together, thereby eliminating cache misses. Using the overflow iteration, the compiler can often do this transformation automatically. Standard blocking transformations cannot be used on many loop nests that contain transformation preventing dependences. Wavefront blocking allows any loop nest to be blocked, when the components of dependence vectors are bounded.
Conference Paper
In this paper, the red-blue pebble game is proposed to model the input-output complexity of algorithms. Using the pebble game formulation, a number of lower bound results for the I/O requirement are proven. For example, it is shown that to perform the n-point FFT or the ordinary n×n matrix multiplication algorithm with O(S) memory, at least &Ohgr;(n log n/log S) or &Ohgr;(n3/@@@@S), respectively, time is needed for the I/O. Similar results are obtained for algorithms for several other problems. All of the lower bounds presented are the best possible in the sense that they are achievable by certain decomposition schemes. Results of this paper may provide insight into the difficult task of balancing I/O and computation in special-purpose system designs. For example, for the n-point FFT, the lower bound on I/O time implies that an S-point device achieving a speed-up ratio of order log S over the conventional O(n log n) time implementation is all one can hope for.
Article
Linear algebra algorithms based on the BLAS or ex tended BLAS do not achieve high performance on mul tivector processors with a hierarchical memory system because of a lack of data locality. For such machines, block linear algebra algorithms must be implemented in terms of matrix-matrix primitives (BLAS3). Designing ef ficient linear algebra algorithms for these architectures requires analysis of the behavior of the matrix-matrix primitives and the resulting block algorithms as a func tion of certain system parameters. The analysis must identify the limits of performance improvement possible via blocking and any contradictory trends that require trade-off consideration. We propose a methodology that facilitates such an analysis and use it to analyze the per formance of the BLAS3 primitives used in block methods. A similar analysis of the block size-perfor mance relationship is also performed at the algorithm level for block versions of the LU decomposition and the Gram-Schmidt orthogonalization procedures.
Article
Matrix representations and operations are examined for the purpose of minimizing the page faulting occurring in a paged memory system. It is shown that carefully designed matrix algorithms can lead to enormous savings in the number of page faults occurring when only a small part of the total matrix can be in main memory at one time. Examination of addition, multiplication, and inversion algorithms shows that a partitioned matrix representation (i.e. one submatrix or partition per page) in most cases induced fewer page faults than a row-by-row representation. The number of page-pulls required by these matrix manipulation algorithms is also studied as a function of the number of pages of main memory available to the algorithm.
Conference Paper
In this paper the authors describe a method for using data dependence analysis to estimate cache and local memory demand in highly iterative scientific codes. The estimates take the form of a family of reference windows for each variable that reflects the current set of elements that should be kept in cache. It is shown that, in important special cases, the authors can estimate the size of the window and predict a lower bound on the number of cache hits. If the machine has local memory or cache that can be managed by the compiler, these estimates can be used to guide the management of this resource. It is also shown that these estimates can be used to guide program transformation in an attempt to optimize cache performance.
Article
This paper proposes an algorithm that improves the locality of a loop nest by transforming the code via interchange, reversal, skewing and tiling. The loop transformation rrlgorithm is based on two concepts: a mathematical formulation of reuse and locality, and a loop transformation theory that unifies the various transforms as unimodular matrix tmnsfonnations. The algorithm haa been implemented in the SUIF (Stanford University Intermediate Format) compiler, and is successful in optimizing codes such as matrix multiplication, successive over-relaxation (SOR), LU decomposition without pivoting, and Givens QR factorization. Performance evaluation indicates that locatity optimization is especially crucial for scaling up the performance of parallel code.
LAPACK working note 18, implementation guide for LAPACK
  • E Anderson
  • J Dongarra
E. Anderson and J. Dongarra. LAPACK working note 18, implementation guide for LAPACK. Technical Report CS- 90-101, University of Tennessee, Apr 1990.
  • G H Golub
  • C F Van Loan
G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins University Press, 1989.