Fig 5 - uploaded by Franz Franchetti
Content may be subject to copyright.
Memory hierarchy. Typical latencies for data transfers from the CPU to each of the levels are shown. The numbers shown here are only an indication, and the actual numbers will depend on the exact architecture under consideration and the access sequence of the program.  

Memory hierarchy. Typical latencies for data transfers from the CPU to each of the levels are shown. The numbers shown here are only an indication, and the actual numbers will depend on the exact architecture under consideration and the access sequence of the program.  

Source publication
Conference Paper
Full-text available
The complexity of modern computing platforms has made it ex- tremely difficult to write numerical code that achieves the best possible perfor- mance. Straightforward implementations based on algorithms that minimize the operations count often fall short in performance by at least one order of magni- tude. This tutorial introduces the reader to a se...

Context in source publication

Context 1
... hierarchy. Most computer systems use a memory hierarchy to bridge the speed gap between the processor(s) and its connection to main memory. As shown in Fig. 5, the highest levels of the memory hierarchy contain the fastest and the smallest memory systems, and vice ...

Citations

... Moreover, computational time t has been evaluated. However, this last datum has to be considered as only indicative of the order of magnitude rather than the exact time involved in each considered retrieval procedure since such knowledge would require the development of numerical codes optimized for a given CPU in order to achieve the best possible computational performances [41]. Figure 1 shows path γ(ω) traced by Equation (8) for the first slab, with d e f f = 180 nm, which is characterized by a single pole Lorentzian model for pair ( r (ω), µ r (ω)). ...
Article
Full-text available
The characterization of electromagnetic metamaterials (MMs) plays a fundamental role in their engineering processes. To this end, the Nicolson–Ross–Weir (NRW) method is intensively used to recover the effective parameters of MMs, even though this is affected by the branch ambiguity problem. In this paper, we face this issue in the context of global analytic functions and Riemann surfaces. This point of view allows us to rigorously demonstrate the mathematical foundations of an algorithmic approach for avoiding the branch ambiguity problem, in which the phase unwrapping method is merged with K-K relations for recovering the effective parameters of an MM. In addition, exploiting the intimate relationship between the K-K relations and the Hilbert transform, a simple variant of the above algorithm is presented.
... There are many books and papers on efficient scientific software development, "Writing scientific software: a guide for good style" [105] and "The Software Optimization Cookbook" [106] are just two out of many examples. A brief introduction on "How to Write Fast Numerical Code" [107] is good to refresh one's memory on particular techniques for efficient coding. One can find short summaries on software engineering best practices in [108][109][110]. ...
Article
Using our decades-long experience in radiative transfer (RT) code development for Earth science, we endeavor to reduce the knowledge gap of bringing RT from theory to code quickly. Despite numerous classic and recent literature, it is still hard to develop an RT code from scratch within a few weeks. It is equally hard to understand, not to mention modify, an existing “monster” RT code, for which the developer is either located remotely or has retired. Following the format of “Numerical Recipes” by Press et al., we collocate in this paper small pieces of necessary theory with corresponding small pieces of RT code. These are arranged in an order that is natural for code development, which is often opposite of the natural order for laying out the theoretical basis. We focus on the transfer of unpolarized monochromatic solar radiation in a plane-parallel atmosphere over a reflecting surface. Both the surface and the atmosphere are homogeneous (uniform) at all directions. The multiple scattering is numerically solved using the deterministic method of Gauss-Seidel iterations. Except for the presented Python-Numba open-source RT code gsit, the paper does not report any new scientific results, but rather serves as an academic demonstration. If development time is an issue or the reader is familiar with basic concepts of RT theory, we recommend proceeding directly to Sec.3 “RT code development”. Program summary Program title: gsit (pronounced “jeezit”) CPC Library link to program files: https://doi.org/10.17632/d3zt5zhx49.1 Developer's repository link: https://github.com/korkins/gsit Licensing provisions: MIT Programming language: Python 3 Nature of problem: We present a tutorial in Python code for deterministic (non-stochastic) numerical simulation of multiple scattering of monochromatic solar light in a plane-parallel Earth atmosphere bounded from below by a reflecting surface. The problem is solved in a simplified form (i.e., uniform atmosphere, no polarization, uniform surface reflectance, etc.) to better explain numerical features, rather than physics, of propagation of light in the atmosphere. Solution method: The method of Gauss-Seidel iterations. It relies on the Fourier decomposition of the Radiative Transfer Equation over azimuth, Gauss quadrature for numerical integration over the zenith and iterative process for integration over height (optical depth) with analytical (hence known) single scattering approximation being the starting point. The method is relatively simple to code and does not require any external libraries.
... We further optimized our code using single-core SIMD vectorization using the AVX2 standard. Rather than attempt to utilize esoteric instructions (e.g, _mm_permute_ps), we follow a straight-forward best practice guideline [Pü11], which advises to rearrange the input data layout to accommodate trivial vectorization of each floating point operator in the scalar code. For example, the scalar product a = b * c becomes the 8-wide vectorized product a = _mm256_mul_ps(b,c). ...
Article
Full-text available
Across computer graphics, vision, robotics and simulation, many applications rely on determining the 3D rotation that aligns two objects or sets of points. The standard solution is to use singular value decomposition (SVD), where the optimal rotation is recovered as the product of the singular vectors. Faster computation of only the rotation is possible using suitable parameterizations of the rotations and iterative optimization. We propose such a method based on the Cayley transformations. The resulting optimization problem allows better local quadratic approximation compared to the Taylor approximation of the exponential map. This results in both faster convergence as well as more stable approximation compared to other iterative approaches. It also maps well to AVX vectorization. We compare our implementation with a wide range of alternatives on real and synthetic data. The results demonstrate up to two orders of magnitude of speedup compared to a straightforward SVD implementation and a 1.5‐6 times speedup over popular optimized code.
... Therefore, reducing and putting them in a limit is always targeted by programmers. For instance, [19] have provided general techniques based on matrix-matrix multiplication and discrete Fourier transforms (DFT) to improve any code containing numerical evaluations. It has been done with emphasis on optimisations of the hierarchy of the computer's memory to fit with platforms of the new computing processors. ...
Conference Paper
Full-text available
We present an optimised software (GEOWARE) for determination of high-frequency geoid height using terrestrial gravity measurements. The optimisation of Stokes integral is based on the extraction of a local area with a radius of a few hundreds kilometres around the computation point which complies with the specified spherical cap sizes. The extraction step is highly important because it detaches the dispensable compartments of the grid which are far from the computation domain. That makes it convenient to avoid passing through the compartments of the entire grid to test whether the spherical distances comply with the truncated cap size or not. Matlab relational operators and vectorisation are powerful optimisation tools because they can replace conditional statements and nested loops efficiently. GEOWARE has been compared with a non-optimised code over different sizes of cap size and it shows a significant improvement in the performance. The run time of GEOWARE in all cap sizes has up to 5 times smaller than that of the code before optimisation. GEOWARE is also compatible with modified Stokes, Newton and Poisson kernels.
... A quick Google search yielded numerous lecture notes and/or homework exercises that utilize this operation [11], [12]. What these materials have in common is that they cite a number of insightful papers [13], [14], [15], [16], [17], [18]. We ourselves created the "how-tooptimize-gemm" wiki [19] and a sandbox that we call BLISlab [20] that build upon our BLAS-like Library Instantiation Software [21], [22] refactoring of the GotoBLAS approach [13] to implementing MMM. ...
... The recent mainstream commodity CPUs enable us to build inexpensive computing systems with similar computational power as the supercomputers just ten years ago. However, these advances in hardware performance result from the increasing complexity of the computer architecture and they actually increase the difficulty of fully utilizing the available computational power for a specific application [4]. This paper focuses on fully utilizing the computing power of modern CPUs by code optimization and parallelization for specific hardware, enabling the realtime complete ACCC application for practical power grids on commodity computing systems. ...
Article
Full-text available
Multi-core CPUs with multiple levels of parallelism (i.e. data level, instruction level and task/core level) have become the mainstream CPUs for commodity computing systems. Based on the multi-core CPUs, in this paper we developed a high performance computing framework for AC contingency calculation (ACCC) to fully utilize the computing power of commodity systems for online and real time applications. Using Woodbury matrix identity based compensation method, we transform and pack multiple contingency cases of different outages into a fine grained vectorized data parallel programming model. We implement the data parallel programming model using SIMD instruction extension on x86 CPUs, therefore, fully taking advantages of the CPU core with SIMD floating point capability. We also implement a thread pool scheduler for ACCC on multi-core CPUs which automatically balances the computing loads across CPU cores to fully utilize the multi-core capability. We test the ACCC solver on the IEEE test systems and on the Polish 3000-bus system using a quad-core Intel Sandy Bridge CPU. The optimized ACCC solver achieves close to linear speedup (SIMD width multiply core numbers) comparing to scalar implementation and is able to solve a complete N-1 line outage AC contingency calculation of the Polish grid within one second on a commodity CPU. It enables the complete ACCC as a real-time application on commodity computing systems.
... Exact dimensions are favorable when writing code constructs that allows performance specific programming. Specifically, as shown in [10] one can adjust the code to the size of available memory given by the hardware. For instance, by applying blocking, loop merging, scheduling, and buffering techniques, the execution time of the compiled code can be much faster. ...
... With these pieces of information one can tailor the code to the given hardware and exploit the maximum performance out of it. This is so called parameterbased performance tuning and according to [10] there are three levels of hardware dependent optimization: ...
Article
The paper presents a review of active set (AS) algorithms that have been deployed for implementation of fast model predictive control (MPC). The main purpose of the survey is to identify the dominant features of the algorithms that contribute to fast execution of online MPC and to study their influence on the speed. The simulation study is conducted on two benchmark examples where the algorithms are analyzed in the number of iterations and in the workload per iteration. The obtained results suggest directions for potential improvement in the speed of existing AS algorithms. Copyright © 2014 John Wiley & Sons, Ltd.
... There are different ways (an levels) to represent the organization of a computer. However, in the last years the perception is that this trend is languishing and that the era free-speedup for legacy code is coming to an end [26]. Since 2005 CPU frequencies have stalled around 2 to 3 GHz and many factors are limiting the growth of achievable performance of single cores: the first important factor is related to the observed non-linear growth of power consumption as the clock rate increases which at the level computers are now turns out to be completely unacceptable (at 130W air cooling systems are no longer practical), clock rates stopped climbing because the power growth needed to be arrested and because of the emergence of mobile and embedded computing where the need for low-power consumption is much more important than it is for desktops and servers. ...
... 26 shows how each one of this terms collaborates to detect the more plausible configurations. ...
Thesis
Full-text available
This thesis proposes a computer vision system for detecting and tracking multiple targets in videos. The covariance matching method is the guiding thread of our work because it offers a compact representation of the target by embedding heterogeneous features in a elegant way. Therefore, it is efficient both for tracking and recognition. Four categories of contributions are proposed. The first one deals with the adaptation to a changing context, following two aspects. A preliminary work consists in the adaptation of color according to lighting variations and relevance of the color. Then, literature shows a wide variety of tracking methods, which have both advantages and limitations, depending on the object to track and the context. Here, a deterministic method is developed to automatically adapt the tracking method to the context through the cooperation of two complementary techniques. A first proposition combines covariance matching for modeling characteristics texture-color information with optical flow (KLT) of a set of points uniformly distributed on the object. A second technique associates covariance and Mean-Shift. In both cases, the cooperation allows a good robustness of the tracking whatever the nature of the target, while reducing the global execution times. The second contribution is the definition of descriptors both discriminative and compact to be included in the target representation. To improve the ability of visual recognition of descriptors two approaches are proposed. The first is an adaptation operators (LBP to Local Binary Patterns ) for inclusion in the covariance matrices . This method is called ELBCM for Enhanced Local Binary Covariance Matrices. The second approach is based on the analysis of different spaces and color invariants to obtain a descriptor which is discriminating and robust to illumination changes. The various experiments implemented in tracking and recognition (texture , faces , pedestrians ) show very promising results. The third contribution addresses the problem of multi-target tracking, the difficulties of which are the matching ambiguities, the occlusions, the merging and division of trajectories. We also propose the re-identification of targets using a set of spatially adapted covariance descriptors and minimizing a function of discrete energy that takes into account the kinematic behavior of the whole objects and model their appearance. Finally to speed algorithms and provide a usable quick solution in embedded applications this thesis proposes a series of optimizations to accelerate the matching using covariance matrices. Data layout transformations, vectorizing the calculations (using SIMD instructions) and some loop transformations had made possible the real-time execution of the algorithm not only on Intel classic but also on embedded platforms (ARM Cortex A9 and Intel U9300).
... Out-of-order execution re-order the instructions according to their dependency, and independent instructions within a instruction dispatch window can be executed simultaneously on multiple functional units. Code optimization techniques such as loop unrolling, mixture of independent instructions, using bigger un-branched code blocks, etc. can be used to exploit the instruction level parallelism on superscalar hardware architectures [4] [6]. 2. Data level: Single Instruction Multiple Data (SIMD): The Streaming SIMD Extensions (SSE) or the Advanced Vector eXtensions (AVX) instruction sets on Intel or AMD's x86 CPUs can perform floating point arithmetic operations on 4 single precision floating point (SSE) or 8 single precision floating point (AVX) data packed in vector register at the same time. ...
... In the most sparse solvers, the traversal over the sparse matrix is guided by nested loops, for example, the upper part of Fig. 13 shows the traversal on sparse matrix of compressed column storage format. The nested loops with only a few operations in the body result in unpredictable branches and limit the out-of-order execution and instruction reordering, hamper efficient register allocation and instruction scheduling [4]. In order to optimize the performance of sparse solver at the instruction level, we employ aggressive loop unrolling to combine consecutive columns into a bigger non-looping, non-branching code blocks. ...
Conference Paper
Full-text available
Large scale integration of stochastic energy resources in power systems requires probabilistic analysis approaches for comprehensive system analysis. The large-varying grid condition on the aging and stressed power system infrastructures also requires merging of offline security analyses into online operation. Meanwhile in computing, the recent rapid hardware performance growth comes from the more and more complicated architecture. Fully utilizing the computing power for specific applications becomes very difficult. Given the challenges and opportunities in both the power system and the computing fields, this paper presents the unique commodity high performance computing system solutions to the following fundamental tools for power system probabilistic and security analysis: 1) a high performance Monte Carlo simulation (MCS) based distribution probabilistic load flow solver for real-time distribution feeder probabilistic solutions. 2) A high performance MCS based transmission probabilistic load flow solver for transmission grid probabilistic analysis. 3) A SIMD accelerated AC contingency calculation solver based on Woodbury matrix identity on multi-core CPUs. By aggressive algorithm level and computer architecture level performance optimizations including optimized data structures, optimization for superscalar out-of-order execution, SIMDization, and multi-core scheduling, our software fully utilizes the modern commodity computing systems, makes the critical and computational intensive power system probabilistic and security analysis problems solvable in real-time on commodity computing systems.
... The performance capabilities of modern computing platforms have been growing rapidly in last several decades at a roughly exponential rate [8] [9]. The new mainstream multi-core CPUs and graphics cards (GPUs) enable us to build inexpensive systems with computational power similar to supercomputers about a decade ago [8]. ...
Conference Paper
Full-text available
Multi-core CPUs with multiple levels of parallelism and deep memory hierarchies have become the mainstream computing platform. In this paper we developed a generally applicable high performance computing framework for Monte Carlo simulation (MCS) type applications in distribution systems, taking advantage of performance-enhancing features of multi-core CPUs. The application in this paper is to solve the probabilistic load flow (PLF) in real time, in order to cope with the uncertainties caused by the integration of renewable energy resources. By applying various performance optimizations and multi-level parallelization, the optimized MCS solver is able to achieve more than 50% of a CPU's theoretical peak performance and the performance is scalable with the hardware parallelism. We tested the MCS solver on the IEEE 37-bus test feeder using a new Intel Sandy Bridge multi-core CPU. The optimized MCS solver is able to solve millions of load flow cases within a second, enabling the real-time Monte Carlo solution of the PLF.