Table 1 - uploaded by Patrick Marlier
Content may be subject to copyright.
Specification of the AMD and Intel processors. 

Specification of the AMD and Intel processors. 

Source publication
Conference Paper
Full-text available
Most multi-core architectures nowadays support dynamic voltage and frequency scaling (DVFS) to adapt their speed to the system’s load and save energy. Some recent architectures additionally allow cores to operate at boosted speeds exceeding the nominal base frequency but within their thermal design power. In this paper, we propose a general-purpos...

Citations

... C-states are managed by the CPUIdle kernel subsystem and cannot be configured in user space. ii) Operating Performance Points (so-called "P-states") implement different dynamic voltage and frequency scaling (DVFS) configurations [57]. Adapting the P-states is currently the standard approach to perform power management when the CPU is busy processing network packets. ...
Article
Emerging microservices demand flexible low-latency processing of network functions in virtualized environments, e.g., as containerized network functions (CNFs). While ensuring highly responsive low-latency CNF processing, the computing environments should conserve energy to reduce costs. In this systems integration study, we develop and evaluate the novel XDP-Monitoring Energy-Adaptive Network Functions (X-MAN) framework for managing the CPU operational states (P-states) so as to reduce the power consumption while prioritizing low-latency service. Architecturally, X-MAN consists of lightweight traffic monitors that are attached to the virtual network interfaces in the kernel space for per-CNF traffic monitoring and a power manager in user space with a global view of the CNFs on a CPU core. Algorithmically, X-MAN monitors the CPU core utilization via hybrid simple and weighted moving average prediction fed by the traffic monitors and a power management based on step-based CPU core frequency (P-state) adjustments. We evaluate X-MAN through extensive measurements in a real physical testbed operating at up to 10 Gbps. We find that X-MAN incurs significantly shorter and more consistent monitoring latencies for the CPU utilization than a state-of-the-art CPU hardware counter approach. Also, X-MAN achieves more responsive CPU core frequency adjustments and more pronounced reductions of the CPU power consumption than a state-of-the-art code instrumentation approach. We make the X-MAN source code publicly available.
... Performance can also be asymmetric in SMP when using the dynamic voltage and frequency scaling (DVFS). However, they can boost the frequency of the lagging core [21,25,57] while AMP cannot. ...
... Computing capacity can also be asymmetric in SMP when using DVFS. Previous work [21,25,57] boosts the frequency of the lock holder to gain better throughput. However, unlike DVFS, the asymmetry in AMP is inherent. ...
Preprint
The pursuit of power-efficiency is popularizing asymmetric multicore processors (AMP) such as ARM big.LITTLE, Apple M1 and recent Intel Alder Lake with big and little cores. However, we find that existing scalable locks fail to scale on AMP and cause collapses in either throughput or latency, or both, because their implicit assumption of symmetric cores no longer holds. To address this issue, we propose the first asymmetry-aware scalable lock named LibASL. LibASL provides a new lock ordering guided by applications' latency requirements, which allows big cores to reorder with little cores for higher throughput under the condition of preserving applications' latency requirements. Using LibASL only requires linking the applications with it and, if latency-critical, inserting few lines of code to annotate the coarse-grained latency requirement. We evaluate LibASL in various benchmarks including five popular databases on Apple M1. Evaluation results show that LibASL can improve the throughput by up to 5 times while precisely preserving the tail latency designated by applications.
... In Linux the CPUFreq kernel driver [27] will chose a set of OPPs based on a specified governor. DVFS has been extensively studied [25,28,29] to accelerate multi-threaded applications. x86 manufacturers use their own DVFS implementations [30,31,32]. ...
Preprint
Full-text available
Latest ARM processors are approaching the computational power of x86 architectures while consuming much less energy. Consequently, supply follows demand with Amazon EC2, Equinix Metal and Microsoft Azure offering ARM-based instances, while Oracle Cloud Infrastructure is about to add such support. We expect this trend to continue, with an increasing number of cloud providers offering ARM-based cloud instances. ARM processors are more energy-efficient leading to substantial electricity savings for cloud providers. However, a malicious cloud provider could intentionally reduce the CPU voltage to further lower its costs. Running applications malfunction when the undervolting goes below critical thresholds. By avoiding critical voltage regions, a cloud provider can run undervolted instances in a stealthy manner. This practical experience report describes a novel attack scenario: an attack launched by the cloud provider against its users to aggressively reduce the processor voltage for saving energy to the last penny. We call it the Scrooge Attack and show how it could be executed using ARM-based computing instances. We mimic ARM-based cloud instances by deploying our own ARM-based devices using different generations of Raspberry Pi. Using realistic and synthetic workloads, we demonstrate to which degree of aggressiveness the attack is relevant. The attack is unnoticeable by our detection method up to an offset of -50mV. We show that the attack may even remain completely stealthy for certain workloads. Finally, we propose a set of client-based detection methods that can identify undervolted instances. We support experimental reproducibility and provide instructions to reproduce our results.
... Boosting techniques have been used to speedup the execution of serial and parallel codes by activating/deactivating the boosting frequencies according to the application characteristics [3], [5]- [7], as shown in Fig. 1. While they are usually turned on to further accelerate the execution of CPU-intensive regions, boosting frequencies are deactivated for memoryintensive regions to save energy as there will be low CPU-usage (Fig. 1b). ...
... When it is turned off, the P1-state is set, which represents the maximum base frequency without the interference of any boosting technique. On the other hand, the higher the value of Pn, the lower the frequency and voltage level [3]. AMD and Intel have some differences w.r.t. the combination of frequencies and cores that can be used, as depicted in Table I for the architectures used in this work. ...
... Autoturbo [16] predicts the characteristics of the application to modify the boosting at run-time. Wamhoff et al., [3] propose a model to guide software designers in using the boosting technique. Maiti et al. [6] propose a framework that improves energy and performance using the boosting feature along with thread mapping. ...
Conference Paper
With the increasing number of cores in modern systems, dynamic concurrency throttling (DCT) and turbo-boosting techniques are becoming a solution to better use the hardware resources. While DCT techniques tune the number of running threads, boosting techniques speed up sequential phases or unbalanced threads. However, as each region of an application may behave differently, optimizing both knobs is not straightforward. Hence, we propose two strategies that apply DCT and turbo-boosting: DBF, which aims to find an ideal configuration for each parallel/sequential region, and DBC, which considers the combination of parallel/sequential regions during the optimization. We show that DBF and DBC improve the EDP by up to 19% and 27% compared to a DCT-only strategy and by up to 95% and 96% compared to a Boost-only technique. We also show that DBF is more suitable for applications with high variability in the CPU workload, while DBC is better when there is low workload variability.
... Additionally, boosting techniques (e.g., Turbo Boost and Turbo CORE) also apply DVFS, but to maximize processor performance by increasing the operating frequency of a group of cores, while the remaining cores perform at a lower frequency or are turned off [1]. These technologies have been widely used to improve the performance of applications from different domains [1], [5], [6], which contains sequential and parallel phases. In sequential phases (e.g., critical sections), as only one core is being used and is in the critical path of the execution, the processor applies boosting to the frequency of this core while lowering the operating frequency of the other cores. ...
... However, DVFS may also be applied to the opposite direction by using the so-called boosting techniques (e.g., Intel's Turbo CORE and AMD's Turbo Boost). In this mode, the operating frequency of a group of cores dynamically increases to maximize performance, while the remaining cores run at a lower operating frequency or are powered down [2], [3], respecting the limits of the thermal design power (TDP). ...
... We compare the results achieved by Poseidon to the following scenarios: Baseline: execution with the maximum number of threads available in the system and with boosting set to off ; Boost-Only: the same as the baseline, but with boosting on; DCT-Boost-Off: the learning algorithm employed by Poseidon is used to find the ideal degree of TLP for each parallel region, but with boosting off ; DCT-Boost-On: the same as the previous one, but with boosting always on; Exhaustive Search Solution: The execution of each application with the optimal number of threads and boosting mode, obtained by previously executing each application with all possible 3 Available at: https://github.com/LLNL/ combinations. ...
Conference Paper
The increasing use of cloud and HPC systems put more pressure on the efficient utilization of hardware resources to keep costs low. Many dynamic concurrency throttling (DCT) techniques have successfully used to tune the number of executing threads to better balance a parallel application according to its available scalability. Similarly, boosting frequency strategies have been used to speed up the sequential parts' execution. Given that, we propose Poseidon, the first transparent and automatic approach that cooperatively exploits both techniques to rebalance OpenMP applications without any preprocessing, with no code transformation, recompilation, or OS modification.
... This high frequency switching rate is only viable because Turbo Boost is integrated into the CPU and is not controlled by the operating system. The operating system only has the possibility to activate or deactivate Turbo Boost by setting the P-states [36]. If P-state P0 is set, the boost frequencies are activated. ...
Preprint
Covert channels are communication channels used by attackers to transmit information from a compromised system when the access control policy of the system does not allow doing so. Previous work has shown that CPU frequency scaling can be used as a covert channel to transmit information between otherwise isolated processes. Modern systems either try to save power or try to operate near their power limits in order to maximize performance, so they implement mechanisms to vary the frequency based on load. Existing covert channels based on this approach are either easily thwarted by software countermeasures or only work on completely idle systems. In this paper, we show how the automatic frequency scaling provided by Intel Turbo Boost can be used to construct a covert channel that is both hard to prevent without significant performance impact and can tolerate significant background system load. As Intel Turbo Boost selects the maximum CPU frequency based on the number of active cores, our covert channel modulates information onto the maximum CPU frequency by placing load on multiple additional CPU cores. Our prototype of the covert channel achieves a throughput of up to 61 bit/s on an idle system and up to 43 bit/s on a system with 25% utilization.
... Such technologies have been largely used to improve performance of parallel applications from different domains [2], [5]- [8], which naturally contain parallel and sequential phases. In sequential phases (e.g., critical sections), as only one core is being used and is in the critical path of execution, the processor usually applies frequency boosting to this core while lowering the operating frequency or cutting off the supply of the other ones that are waiting. ...
... Esmaeilzadeh et al. [14] show that Turbo Boost is not energy efficient on the Intel Core i7. Wamhoff et al. [5] show that turbo frequencies can optimize the performance of applications that process a workload asymmetrically, by speeding up the cores with more workload. ...
... Furthermore, some works propose modifications in the boosting feature to optimize the behavior of applications. Wamhoff et al. [5] present a library to manually control the boosting on AMD and Intel processors. Raghavan et al. [18] propose a computational sprinting, which boosts the frequency and voltage to power levels that exceed the TDP for a few seconds in order to deliver better performance for end users. ...
Conference Paper
Technologies that improve the performance of parallel applications by increasing the nominal operating frequency of processors respecting a given TDP (Thermal Design Power) have been widely used. However, they may impact on other non-functional requirements in different ways (e.g. increasing energy consumption or aging). Therefore, considering the huge number of configurations available, represented by the range of all possible combinations among different parallel applications, amount of threads, dynamic voltage and frequency scaling (DVFS) governors, boosting technologies and simultaneous multithreading (SMT), selecting the one that offers the best trade-off for a non-functional requirement is extremely challenging for software designers. Given that, in this work we assess the impact of changing these configurations on the energy consumption, performance, and aging of parallel applications on a turbo-compliant processor. Results show that there is no single configuration that would provide the best solution for all non-functional requirements at once. For instance, we demonstrate that the configuration that offers the best performance is the same one that has the worst impact on aging, accelerating it by up to 1.75 times. With our experiments, we provide guidelines for the developer when it comes to tuning performance using turbo boosting to save as much energy as possible and increase the lifespan of the hardware components
... For example, consider a PM governor that dynamically adjusts the V/F based on the utilization of a physical core (e.g., ondemand governor). Such a governor often fails to determine the V/F appropriate for consolidated VMs because it uses the overall utilization of the physical core sampled over a substantial period instead of that of individual VMs; the minimum sampling period set by Linux and Xen is 10ms due to various overheads caused by changing V/F [31]. When heterogeneous VMs share a physical core, a CPU-intensive VM may run at low V/F influenced by a lesser CPU consuming VM that previously ran on the core, or vice versa. ...
... Application-level frequency control. The TURBO diaries [31] proposes a library that enables control of frequency from user space by modifying source codes of applications. Thus, it requires code-level knowledge of applications to control frequency effectively. ...
Conference Paper
A power management policy aims to improve energy efficiency by choosing an appropriate performance (voltage/frequency) state for a given core. In current virtualized environments, multiple virtual machines (VMs) running on the same core must follow a single power management policy governed by the hypervisor. However, we observe that such a per-core power management policy has two limitations. First, it cannot offer the flexibility of choosing a desirable power management policy for each VM (or client). Second, it often hurts the power efficiency of some or even all VMs especially when the VMs desire conflicting power management policies. To tackle these limitations, we propose a per-VM power management mechanism, VIP supporting Virtual Performance-state for each VM. Specifically, for VMs sharing a core, VIP allows each VM's guest OS to deploy its own desired power management policy while preventing such VMs from interfering/influencing each other's power management policy. That is, VIP can also facilitate a pricing model based on the choice of a power management policy. Second, identifying some inefficiency in strictly enforcing per-VM power management policies, we propose hypervisor-assisted techniques to further improve power and energy efficiency without compromising the key benefits of per-VM power management. To demonstrate the efficacy of VIP, we take a case that some VMs run CPU-intensive applications and other VMs run latency-sensitive applications sharing the same cores. Our evaluation shows that VIP reduces the overall energy consumption and improves the execution time of CPU-intensive applications compared with the default ondemand governor of Xen hypervisor up to 27% and 32%, respectively, without violating service level agreement (SLA) of latency-sensitive applications.
... Cette modification au niveaux des WCETs engendre une augmentation de la fréquence d'horloge (nominale) du coeur C lors de l'exécution de certaines tâches. Similairement à l'article [94], nous supposons que la fréquence nominale F n = 2Ghz avec un pas de fréquence de 100M hz, avec : ...
Thesis
La conception de systèmes temps-réel embarqués se développe de plus en plus avec l’intégration croissante de fonctionnalités critiques pour les applications de surveillance, notamment dans le domaine biomédical, environnemental, domotique, etc. Le développement de ces systèmes doit relever divers défis en termes de minimisation de la consommation énergétique. Gérer de tels dispositifs embarqués, entièrement autonomes, nécessite cependant de résoudre différents problèmes liés à la quantité d’énergie disponible dans la batterie, à l’ordonnancement temps-réel des tâches qui doivent être exécutées avant leurs échéances, aux scénarios de reconfiguration, particulièrement dans le cas d’ajout de tâches, et à la contrainte de communication pour pouvoir assurer l’échange des messages entre les processeurs, de façon à assurer une autonomie durable jusqu’à la prochaine recharge et ce, tout en maintenant un niveau de qualité de service acceptable du système de traitement. Pour traiter cette problématique, nous proposons dans ces travaux une stratégie de placement et d’ordonnancement de tâches permettant d’exécuter des applications temps-réel sur une architecture contenant des cœurs hétérogènes. Dans cette thèse, nous avons choisi d’aborder cette problématique de façon incrémentale pour traiter progressivement les problèmes liés aux contraintes temps-réel, énergétique et de communications. Tout d’abord, nous nous intéressons particulièrement à l’ordonnancement des tâches sur une architecture mono-cœur. Nous proposons une stratégie d’ordonnancement basée sur le regroupement des tâches dans des packs pour pouvoir calculer facilement les nouveaux paramètres des tâches afin de réobtenir la faisabilité du système. Puis, nous l’avons étendu pour traiter le cas de l’ordonnancement sur une architecture multi-cœurs homogènes. Finalement, une extension de ce dernier sera réalisée afin d’arriver à l’objectif principal qui est l’ordonnancement des tâches pour les architectures hétérogènes. L’idée est de prendre progressivement en compte des contraintes d’exécution de plus en plus complexes. Nous formalisons tous les problèmes en utilisant la formulation ILP afin de pouvoir produire des résultats optimaux. L’idée est de pouvoir situer nos solutions proposées par rapport aux solutions optimales produites par un solveur et par rapport aux autres algorithmes de l’état de l’art. Par ailleurs, la validation par simulation des stratégies proposées montre qu’elles engendrent un gain appréciable vis-à-vis des critères considérés importants dans les systèmes embarqués, notamment le coût de la communication entre cœurs et le taux de rejet des tâches.