Fig 3 - uploaded by Ahmad Yasin
Content may be subject to copyright.
Abstraction of processor core microarchitecture.

Abstraction of processor core microarchitecture.

Source publication
Article
Full-text available
The slowdown in technology scaling puts architectural features at the forefront of the innovation in modern processors. This article presents a Metric-Guided Method (MGM) that extends Top-Down analysis with carefully selected, dynamically adapted metrics in a structured approach. Using MGM, we conduct two evaluations at the microarchitecture and th...

Contexts in source publication

Context 1
... are three main sections in a modern processor, as depicted in Figure 3: the front-end, the execution engine, and the memory subsystem. ...
Context 2
... high-level description of the microarchitectures shown in Figure 3 applies to the designs of Intel Skylake [6], AMD Zen [5], and Qualcomm Falkor [7] that power the most competitive mainstream servers [10]. While the principles of an in-order fetch and commit pipelines with an out-of-order execution are common, the details vary. ...
Context 3
... are three main sections in a modern processor, as depicted in Figure 3: the front-end, the execution engine, and the memory subsystem. ...
Context 4
... high-level description of the microarchitectures shown in Figure 3 applies to the designs of Intel Skylake [6], AMD Zen [5], and Qualcomm Falkor [7] that power the most competitive mainstream servers [10]. While the principles of an in-order fetch and commit pipelines with an out-of-order execution are common, the details vary. ...

Similar publications

Article
Full-text available
In order to assess pollutants and impact of environmental changes along the Egyptian Red Sea coast, seven recent and Pleistocene coral species have been analyzed for Zn, Pb, Mn, Fe, Cr, Co, Ni, and Cu. Results show that the concentration of trace elements in recent coral skeletons is higher than those of Pleistocene counterpart except for Mn and Ni...
Article
Full-text available
Diffusion-weighted magnetic resonance imaging (DW-MRI) is a diagnostic tool that is increasingly used for the detection and characterization of focal masses in the abdomen, among these, pancreatic ductal adenocarcinoma (PDAC). DW-MRI reflects the microarchitecture of the tissue, and changes in diffusion, which are reflected by changes in the appare...
Article
Full-text available
It is well established that tissue macrophages and tissue-resident memory CD8 ⁺ T cells (T RM ) play important roles for pathogen sensing and rapid protection of barrier tissues. In contrast, the mechanisms by which these two cell types cooperate for homeostatic organ surveillance after clearance of infections is poorly understood. Here, we used in...
Conference Paper
Full-text available
Microarchitecture based side-channel attacks are common threats nowadays. Intel SGX technology provides a strong isolation from an adversarial OS, however, does not guarantee protection against side-channel attacks. In this paper, we analyze the security of the mbedTLS binary GCD algorithm, an implementation that offers interesting challenges when...
Article
Full-text available
Background: To reveal trends in bone microarchitectural parameters with increasing spatial resolution on ultra-high-resolution computed tomography (UHRCT) in vivo and to compare its performance with that of conventional-resolution CT (CRCT) and micro-CT ex vivo. Methods: We retrospectively assessed 5 tiger vertebrae ex vivo and 16 human tibiae i...

Citations

... We define the frequency scalability of a workload as the change in its performance with unit change in frequency, as in[107,[144][145][146]. ...
... Note that the machine-learning method we used automatically assigns low additive-regression coefficients to less relevant performance metrics in the models, so irrelevant metrics could be automatically discarded from the models [27]. Notably, most of the metrics identified as relevant by the machine-learning engine for SF prediction on the big core are based on the TMA (Top-Down Microarchitecture Analysis) event type, which were recently introduced by Intel to aid in the fine-grained identification of application performance bottlenecks [34]. As a new feature of Intel Alder Lake processors, these TMA metrics can be monitored altogether on P-cores by using a single PMC [15]. ...
... 3) memory system. Fig. 1(a) shows the architecture used in recent Intel processors (e.g., Skylake [23,33,34], Co ee Lake [25], and Cannon Lake [26]) with a focus on CPU cores. Power Management. ...
... DarkGates' three key components are implemented within the Intel Skylake SoC [23,33,34]. ...
... These mechanisms optimize voltage guardband using hardware and/or software sensors to reduce the operating margin for energy savings. Multiple of these guardband reduction mechanisms are already applied in the Skylake processor [14,23,34,72,74]. DarkGates can be applied orthogonally to these mechanisms since it physically optimizes the system impedance by bypassing the power-gates and sharing the power delivery resources on the package. ...
Preprint
Full-text available
To reduce the leakage power of inactive (dark) silicon components, modern processor systems shut-off these components' power supply using low-leakage transistors, called power-gates. Unfortunately, power-gates increase the system's power-delivery impedance and voltage guardband, limiting the system's maximum attainable voltage (i.e., Vmax) and, thus, the CPU core's maximum attainable frequency (i.e., Fmax). As a result, systems that are performance constrained by the CPU frequency (i.e., Fmax-constrained), such as high-end desktops, suffer significant performance loss due to power-gates. To mitigate this performance loss, we propose DarkGates, a hybrid system architecture that increases the performance of Fmax-constrained systems while fulfilling their power efficiency requirements. DarkGates is based on three key techniques: i) bypassing on-chip power-gates using package-level resources (called bypass mode), ii) extending power management firmware to support operation either in bypass mode or normal mode, and iii) introducing deeper idle power states. We implement DarkGates on an Intel Skylake microprocessor for client devices and evaluate it using a wide variety of workloads. On a real 4-core Skylake system with integrated graphics, DarkGates improves the average performance of SPEC CPU2006 workloads across all thermal design power (TDP) levels (35W-91W) between 4.2% and 5.3%. DarkGates maintains the performance of 3DMark workloads for desktop systems with TDP greater than 45W while for a 35W-TDP (the lowest TDP) desktop it experiences only a 2% degradation. In addition, DarkGates fulfills the requirements of the ENERGY STAR and the Intel Ready Mode energy efficiency benchmarks of desktop systems.
... Our finding on Ivy Bridge vs. Broadwell corroborates the recent work of Yasin et al. [60] where SPEC benchmarks are evaluated across Ivy Bridge and Skylake (the generation after Broadwell) micro-architectures. Yasin et al. also show that the improvement on the Skylake micro-architecture, which inherits the improvements from the Broadwell microarchitecture, significantly reduces the Icache stalls. ...
Article
Full-text available
Micro-architectural behavior of traditional disk-based online transaction processing (OLTP) systems has been investigated extensively over the past couple of decades. Results show that traditional OLTP systems mostly under-utilize the available micro-architectural resources. In-memory OLTP systems, on the other hand, process all the data in main-memory and, therefore, can omit the buffer pool. Furthermore, they usually adopt more lightweight concurrency control mechanisms, cache-conscious data structures, and cleaner codebases since they are usually designed from scratch. Hence, we expect significant differences in micro-architectural behavior when running OLTP on platforms optimized for in-memory processing as opposed to disk-based database systems. In particular, we expect that in-memory systems exploit micro-architectural features such as instruction and data caches significantly better than disk-based systems. This paper sheds light on the micro-architectural behavior of in-memory database systems by analyzing and contrasting it to the behavior of disk-based systems when running OLTP workloads. The results show that, despite all the design changes, in-memory OLTP exhibits very similar micro-architectural behavior to disk-based OLTP: more than half of the execution time goes to memory stalls where instruction cache misses or the long-latency data misses from the last-level cache (LLC) are the dominant factors in the overall execution time. Even though ground-up designed in-memory systems can eliminate the instruction cache misses, the reduction in instruction stalls amplifies the impact of LLC data misses. As a result, only 30% of the CPU cycles are used to retire instructions, and 70% of the CPU cycles are wasted to stalls for both traditional disk-based and new generation in-memory OLTP.
Article
Collecting sufficient microarchitecture performance data is essential for performance evaluation and workload characterization. There are many events to be monitored in a modern processor while only a few hardware performance monitoring counters (PMCs) can be used, so multiplexing is commonly adopted. However, inefficiency commonly exists in state-of-the-art profiling tools when grouping events for multiplexing PMCs. It has the risk of inaccurate measurement and misleading analysis. Commercial tools can leverage PMCs but they are closed-source and only support their specified platforms. To this end, we propose an approach for efficient cross-platform microarchitecture performance measurement via adaptive grouping, aiming to improve the metrics’ sampling ratios. The approach generates event groups based on the number of available PMCs detected on arbitrary machines while avoiding the scheduling pitfall of Linux perf_event subsystem. We evaluate our approach with SPEC CPU 2017 on four mainstream x86-64 and AArch64 processors and conduct comparative analyses of efficiency with two other state-of-the-art tools, LIKWID and ARM Top-down Tool. The experimental results indicate that our approach gains around 50% improvement in the average sampling ratio of metrics without compromising the correctness and reliability.
Article
Asymmetric multicore processors (AMPs) couple high‐performance big cores and power‐efficient small ones, all exposing a shared instruction set architecture to software, but with different microarchitectural features. The energy efficiency benefits of AMPs, together with the general‐purpose nature of the various cores, have led hardware manufacturers to build commercial AMP‐based products, first for the mobile and embedded domains, and more recently, for the desktop market segment, as with the Intel Alder Lake processor family. This trend indicates that AMPs may become a solid and more energy efficient replacement for symmetric multicores in a wide range of application domains. Previous research has demonstrated that the system software can substantially improve scheduling—critical to get the most out of heterogeneous cores—by leveraging hardware facilities that are directly managed by the OS, such as performance monitoring counters, or the recently introduced Intel Thread Director technology. Unfortunately, the OS‐level support enabling access to these hardware facilities may often take a long time to be adopted in operating systems, or may come in forms that make its utilization challenging from specific levels of the system software stack, especially in production systems. To fill this gap, we propose PMCSched, an open‐source framework enabling rapid development and evaluation of custom scheduling‐related support in the Linux kernel. PMCSched greatly simplifies the design and implementation of a wide range of scheduling policies for multicore systems that operate at different system software layers without requiring to patch the kernel. To demonstrate the potential of our framework, we conduct a set of experimental case studies on asymmetry‐aware scheduling for Intel Alder Lake processors.
Chapter
The Top-Down method makes it possible to identify bottlenecks as instructions traverse the CPU’s pipeline. Once bottlenecks are identified, incremental changes to the code can be made to mitigate the negative effects bottlenecks might have in performance. This is an iterative process that could potentially result in a more optimal use of CPU resources. It can be difficult to compare bottleneck metrics of the same program generated by different compilers running on the same system. Different compilers could potentially generate different instructions, arrange the instructions in different order, and require different number of cycles to execute the program. Ratios with relatively similar values could hide valuable information that could be used to identify differences in magnitude and influence of bottlenecks. To amplify magnitude differences of bottleneck metrics, we use the cycles required to complete the program as a reference point. We can then quantify the relative difference the effect a bottleneck has when compared with the bottleneck of the reference compiler. This study’s proposed approach is based on the Purchasing Power Parity theory, which is used by economists to compare the purchasing power of different currencies by comparing similar products. We show that this approach can give us more information on how effective each compiler is in using the CPU’s architectural features by comparing their respective bottlenecks. For example, using conventional methods, our measurements show that for the 363.swim benchmark, BackEnd Bound rates for GCC4 was 0.949, and 0.956 for GCC6 and GCC7 respectively. However, using the PPP normalization approach, we showed that there were differences of 55.3% for GCC6 and 54.9% for GCC7 over GCC4.