Abstraction of processor core microarchitecture.

Source publication

A Metric-Guided Method for Discovering Impactful Features and Architectural Insights for Skylake-Based Processors

Article

Full-text available

Dec 2019

The slowdown in technology scaling puts architectural features at the forefront of the innovation in modern processors. This article presents a Metric-Guided Method (MGM) that extends Top-Down analysis with carefully selected, dynamically adapted metrics in a structured approach. Using MGM, we conduct two evaluations at the microarchitecture and th...

Context 1

... are three main sections in a modern processor, as depicted in Figure 3: the front-end, the execution engine, and the memory subsystem. ...

View in full-text

Context 2

... high-level description of the microarchitectures shown in Figure 3 applies to the designs of Intel Skylake [6], AMD Zen [5], and Qualcomm Falkor [7] that power the most competitive mainstream servers [10]. While the principles of an in-order fetch and commit pipelines with an out-of-order execution are common, the details vary. ...

View in full-text

Context 3

... are three main sections in a modern processor, as depicted in Figure 3: the front-end, the execution engine, and the memory subsystem. ...

View in full-text

Context 4

View in full-text

Heavy metals contamination of the Quaternary coral reefs, Red Sea coast, Egypt Heavy metals contamination of the Quaternary coral reefs, Red Sea coast, Egypt

Article

Full-text available

Oct 2012

In order to assess pollutants and impact of environmental changes along the Egyptian Red Sea coast, seven recent and Pleistocene coral species have been analyzed for Zn, Pb, Mn, Fe, Cr, Co, Ni, and Cu. Results show that the concentration of trace elements in recent coral skeletons is higher than those of Pleistocene counterpart except for Mn and Ni...

Figure 2. Tumor-infiltrating T cells in pancreatic ductal...

Figure 4. PDAC cell lines. (A) Laser scan microscopy and Western...

Figure 5. Results of the colony formation assays. (A) Bar graphs show...

Demographic, clinical, and pathological characteristics of all patients.

The Microarchitecture of Pancreatic Cancer as Measured by Diffusion-Weighted Magnetic Resonance Imaging Is Altered by T Cells with a Tumor Promoting Th17 Phenotype

Article

Full-text available

Jan 2020

Diffusion-weighted magnetic resonance imaging (DW-MRI) is a diagnostic tool that is increasingly used for the detection and characterization of focal masses in the abdomen, among these, pancreatic ductal adenocarcinoma (PDAC). DW-MRI reflects the microarchitecture of the tissue, and changes in diffusion, which are reflected by changes in the appare...

Salivary gland macrophages and tissue-resident CD8 + T cells cooperate for homeostatic organ surveillance

Article

Full-text available

Apr 2020

It is well established that tissue macrophages and tissue-resident memory CD8 ⁺ T cells (T RM ) play important roles for pathogen sensing and rapid protection of barrier tissues. In contrast, the mechanisms by which these two cell types cooperate for homeostatic organ surveillance after clearance of infections is poorly understood. Here, we used in...

When one vulnerable primitive turns viral: Novel single-trace attacks on ECDSA and RSA

Conference Paper

Full-text available

Mar 2020

Microarchitecture based side-channel attacks are common threats nowadays. Intel SGX technology provides a strong isolation from an adversarial OS, however, does not guarantee protection against side-channel attacks. In this paper, we analyze the security of the mbedTLS binary GCD algorithm, an implementation that offers interesting challenges when...

Fig. 2 Patient selection flowchart. UHRCT Ultra-high-resolution...

Comparison of human tibial microarchitectural parameters on UHRCT with...

Bone microarchitectural analysis using ultra-high-resolution CT in tiger vertebra and human tibia

Article

Full-text available

Dec 2020

Background: To reveal trends in bone microarchitectural parameters with increasing spatial resolution on ultra-high-resolution computed tomography (UHRCT) in vivo and to compare its performance with that of conventional-resolution CT (CRCT) and micro-CT ex vivo. Methods: We retrospectively assessed 5 tiger vertebrae ex vivo and 16 human tibiae i...

AgileWatts: An Energy-Efficient CPU Core Idle-State Architecture for Latency-Sensitive Server Applications

Conference Paper

Full-text available

Oct 2022

Evaluation of the Intel thread director technology on an Alder Lake processor

Conference Paper

Aug 2022

DarkGates: A Hybrid Power-Gating Architecture to Mitigate the Performance Impact of Dark-Silicon in High Performance Processors

Preprint

Full-text available

Dec 2021

To reduce the leakage power of inactive (dark) silicon components, modern processor systems shut-off these components' power supply using low-leakage transistors, called power-gates. Unfortunately, power-gates increase the system's power-delivery impedance and voltage guardband, limiting the system's maximum attainable voltage (i.e., Vmax) and, thus, the CPU core's maximum attainable frequency (i.e., Fmax). As a result, systems that are performance constrained by the CPU frequency (i.e., Fmax-constrained), such as high-end desktops, suffer significant performance loss due to power-gates. To mitigate this performance loss, we propose DarkGates, a hybrid system architecture that increases the performance of Fmax-constrained systems while fulfilling their power efficiency requirements. DarkGates is based on three key techniques: i) bypassing on-chip power-gates using package-level resources (called bypass mode), ii) extending power management firmware to support operation either in bypass mode or normal mode, and iii) introducing deeper idle power states. We implement DarkGates on an Intel Skylake microprocessor for client devices and evaluate it using a wide variety of workloads. On a real 4-core Skylake system with integrated graphics, DarkGates improves the average performance of SPEC CPU2006 workloads across all thermal design power (TDP) levels (35W-91W) between 4.2% and 5.3%. DarkGates maintains the performance of 3DMark workloads for desktop systems with TDP greater than 45W while for a 35W-TDP (the lowest TDP) desktop it experiences only a 2% degradation. In addition, DarkGates fulfills the requirements of the ENERGY STAR and the Intel Ready Mode energy efficiency benchmarks of desktop systems.

Micro-architectural analysis of in-memory OLTP: Revisited

Article

Full-text available

Jul 2021
VLDB J

Micro-architectural behavior of traditional disk-based online transaction processing (OLTP) systems has been investigated extensively over the past couple of decades. Results show that traditional OLTP systems mostly under-utilize the available micro-architectural resources. In-memory OLTP systems, on the other hand, process all the data in main-memory and, therefore, can omit the buffer pool. Furthermore, they usually adopt more lightweight concurrency control mechanisms, cache-conscious data structures, and cleaner codebases since they are usually designed from scratch. Hence, we expect significant differences in micro-architectural behavior when running OLTP on platforms optimized for in-memory processing as opposed to disk-based database systems. In particular, we expect that in-memory systems exploit micro-architectural features such as instruction and data caches significantly better than disk-based systems. This paper sheds light on the micro-architectural behavior of in-memory database systems by analyzing and contrasting it to the behavior of disk-based systems when running OLTP workloads. The results show that, despite all the design changes, in-memory OLTP exhibits very similar micro-architectural behavior to disk-based OLTP: more than half of the execution time goes to memory stalls where instruction cache misses or the long-latency data misses from the last-level cache (LLC) are the dominant factors in the overall execution time. Even though ground-up designed in-memory systems can eliminate the instruction cache misses, the reduction in instruction stalls amplifies the impact of LLC data misses. As a result, only 30% of the CPU cycles are used to retire instructions, and 70% of the CPU cycles are wasted to stalls for both traditional disk-based and new generation in-memory OLTP.

Efficient Cross-platform Multiplexing of Hardware Performance Counters via Adaptive Grouping

Article

Oct 2023
ACM T ARCHIT CODE OP

Collecting sufficient microarchitecture performance data is essential for performance evaluation and workload characterization. There are many events to be monitored in a modern processor while only a few hardware performance monitoring counters (PMCs) can be used, so multiplexing is commonly adopted. However, inefficiency commonly exists in state-of-the-art profiling tools when grouping events for multiplexing PMCs. It has the risk of inaccurate measurement and misleading analysis. Commercial tools can leverage PMCs but they are closed-source and only support their specified platforms. To this end, we propose an approach for efficient cross-platform microarchitecture performance measurement via adaptive grouping, aiming to improve the metrics’ sampling ratios. The approach generates event groups based on the number of available PMCs detected on arbitrary machines while avoiding the scheduling pitfall of Linux perf_event subsystem. We evaluate our approach with SPEC CPU 2017 on four mainstream x86-64 and AArch64 processors and conduct comparative analyses of efficiency with two other state-of-the-art tools, LIKWID and ARM Top-down Tool. The experimental results indicate that our approach gains around 50% improvement in the average sampling ratio of metrics without compromising the correctness and reliability.

Flexible system software scheduling for asymmetric multicore systems with PMCSched: A case for Intel Alder Lake

Article

Jun 2023
CONCURR COMP-PRACT E

Asymmetric multicore processors (AMPs) couple high‐performance big cores and power‐efficient small ones, all exposing a shared instruction set architecture to software, but with different microarchitectural features. The energy efficiency benefits of AMPs, together with the general‐purpose nature of the various cores, have led hardware manufacturers to build commercial AMP‐based products, first for the mobile and embedded domains, and more recently, for the desktop market segment, as with the Intel Alder Lake processor family. This trend indicates that AMPs may become a solid and more energy efficient replacement for symmetric multicores in a wide range of application domains. Previous research has demonstrated that the system software can substantially improve scheduling—critical to get the most out of heterogeneous cores—by leveraging hardware facilities that are directly managed by the OS, such as performance monitoring counters, or the recently introduced Intel Thread Director technology. Unfortunately, the OS‐level support enabling access to these hardware facilities may often take a long time to be adopted in operating systems, or may come in forms that make its utilization challenging from specific levels of the system software stack, especially in production systems. To fill this gap, we propose PMCSched, an open‐source framework enabling rapid development and evaluation of custom scheduling‐related support in the Linux kernel. PMCSched greatly simplifies the design and implementation of a wide range of scheduling policies for multicore systems that operate at different system software layers without requiring to patch the kernel. To demonstrate the potential of our framework, we conduct a set of experimental case studies on asymmetry‐aware scheduling for Intel Alder Lake processors.

Performance Analysis with Unified Hardware Counter Metrics

Conference Paper

Nov 2022

Rethinking Stateful Stream Processing with RDMA

Conference Paper

Jun 2022

DarkGates: A Hybrid Power-Gating Architecture to Mitigate the Performance Impact of Dark-Silicon in High Performance Processors

Conference Paper

Apr 2022

Enhancing the Top-Down Microarchitectural Analysis Method Using Purchasing Power Parity Theory

Chapter

Feb 2022

The Top-Down method makes it possible to identify bottlenecks as instructions traverse the CPU’s pipeline. Once bottlenecks are identified, incremental changes to the code can be made to mitigate the negative effects bottlenecks might have in performance. This is an iterative process that could potentially result in a more optimal use of CPU resources. It can be difficult to compare bottleneck metrics of the same program generated by different compilers running on the same system. Different compilers could potentially generate different instructions, arrange the instructions in different order, and require different number of cycles to execute the program. Ratios with relatively similar values could hide valuable information that could be used to identify differences in magnitude and influence of bottlenecks. To amplify magnitude differences of bottleneck metrics, we use the cycles required to complete the program as a reference point. We can then quantify the relative difference the effect a bottleneck has when compared with the bottleneck of the reference compiler. This study’s proposed approach is based on the Purchasing Power Parity theory, which is used by economists to compare the purchasing power of different currencies by comparing similar products. We show that this approach can give us more information on how effective each compiler is in using the CPU’s architectural features by comparing their respective bottlenecks. For example, using conventional methods, our measurements show that for the 363.swim benchmark, BackEnd Bound rates for GCC4 was 0.949, and 0.956 for GCC6 and GCC7 respectively. However, using the PPP normalization approach, we showed that there were differences of 55.3% for GCC6 and 54.9% for GCC7 over GCC4.

Abstraction of processor core microarchitecture.

Contexts in source publication

Similar publications

Citations