Specification of the AMD and Intel processors.

X-MAN: A Non-intrusive Power Manager for Energy-adaptive Cloud-native Network Functions

Article

Nov 2021

Emerging microservices demand flexible low-latency processing of network functions in virtualized environments, e.g., as containerized network functions (CNFs). While ensuring highly responsive low-latency CNF processing, the computing environments should conserve energy to reduce costs. In this systems integration study, we develop and evaluate the novel XDP-Monitoring Energy-Adaptive Network Functions (X-MAN) framework for managing the CPU operational states (P-states) so as to reduce the power consumption while prioritizing low-latency service. Architecturally, X-MAN consists of lightweight traffic monitors that are attached to the virtual network interfaces in the kernel space for per-CNF traffic monitoring and a power manager in user space with a global view of the CNFs on a CPU core. Algorithmically, X-MAN monitors the CPU core utilization via hybrid simple and weighted moving average prediction fed by the traffic monitors and a power management based on step-based CPU core frequency (P-state) adjustments. We evaluate X-MAN through extensive measurements in a real physical testbed operating at up to 10 Gbps. We find that X-MAN incurs significantly shorter and more consistent monitoring latencies for the CPU utilization than a state-of-the-art CPU hardware counter approach. Also, X-MAN achieves more responsive CPU core frequency adjustments and more pronounced reductions of the CPU power consumption than a state-of-the-art code instrumentation approach. We make the X-MAN source code publicly available.

Asymmetry-aware Scalable Locking

Preprint

Aug 2021

The pursuit of power-efficiency is popularizing asymmetric multicore processors (AMP) such as ARM big.LITTLE, Apple M1 and recent Intel Alder Lake with big and little cores. However, we find that existing scalable locks fail to scale on AMP and cause collapses in either throughput or latency, or both, because their implicit assumption of symmetric cores no longer holds. To address this issue, we propose the first asymmetry-aware scalable lock named LibASL. LibASL provides a new lock ordering guided by applications' latency requirements, which allows big cores to reorder with little cores for higher throughput under the condition of preserving applications' latency requirements. Using LibASL only requires linking the applications with it and, if latency-critical, inserting few lines of code to annotate the coarse-grained latency requirement. We evaluate LibASL in various benchmarks including five popular databases on Apple M1. Evaluation results show that LibASL can improve the throughput by up to 5 times while precisely preserving the tail latency designated by applications.

Scrooge Attack: Undervolting ARM Processors for Profit

Preprint

Full-text available

Jul 2021

Latest ARM processors are approaching the computational power of x86 architectures while consuming much less energy. Consequently, supply follows demand with Amazon EC2, Equinix Metal and Microsoft Azure offering ARM-based instances, while Oracle Cloud Infrastructure is about to add such support. We expect this trend to continue, with an increasing number of cloud providers offering ARM-based cloud instances. ARM processors are more energy-efficient leading to substantial electricity savings for cloud providers. However, a malicious cloud provider could intentionally reduce the CPU voltage to further lower its costs. Running applications malfunction when the undervolting goes below critical thresholds. By avoiding critical voltage regions, a cloud provider can run undervolted instances in a stealthy manner. This practical experience report describes a novel attack scenario: an attack launched by the cloud provider against its users to aggressively reduce the processor voltage for saving energy to the last penny. We call it the Scrooge Attack and show how it could be executed using ARM-based computing instances. We mimic ARM-based cloud instances by deploying our own ARM-based devices using different generations of Raspberry Pi. Using realistic and synthetic workloads, we demonstrate to which degree of aggressiveness the attack is relevant. The attack is unnoticeable by our detection method up to an offset of -50mV. We show that the attack may even remain completely stealthy for certain workloads. Finally, we propose a set of client-based detection methods that can identify undervolted instances. We support experimental reproducibility and provide instructions to reproduce our results.

Optimizing Parallel Applications via Dynamic Concurrency Throttling and Turbo Boosting

Conference Paper

Mar 2021

With the increasing number of cores in modern systems, dynamic concurrency throttling (DCT) and turbo-boosting techniques are becoming a solution to better use the hardware resources. While DCT techniques tune the number of running threads, boosting techniques speed up sequential phases or unbalanced threads. However, as each region of an application may behave differently, optimizing both knobs is not straightforward. Hence, we propose two strategies that apply DCT and turbo-boosting: DBF, which aims to find an ideal configuration for each parallel/sequential region, and DBC, which considers the combination of parallel/sequential regions during the optimization. We show that DBF and DBC improve the EDP by up to 19% and 27% compared to a DCT-only strategy and by up to 95% and 96% compared to a Boost-only technique. We also show that DBF is more suitable for applications with high variability in the CPU workload, while DBC is better when there is low workload variability.

EDP Optimization of Parallel Applications via CPU Frequency Scaling on AMD Processors

Conference Paper

Feb 2021

Synergically Rebalancing Parallel Execution via DCT and Turbo Boosting

Conference Paper

Feb 2021

The increasing use of cloud and HPC systems put more pressure on the efficient utilization of hardware resources to keep costs low. Many dynamic concurrency throttling (DCT) techniques have successfully used to tune the number of executing threads to better balance a parallel application according to its available scalability. Similarly, boosting frequency strategies have been used to speed up the sequential parts' execution. Given that, we propose Poseidon, the first transparent and automatic approach that cooperatively exploits both techniques to rebalance OpenMP applications without any preprocessing, with no code transformation, recompilation, or OS modification.

TurboCC: A Practical Frequency-Based Covert Channel With Intel Turbo Boost

Preprint

Jul 2020

Covert channels are communication channels used by attackers to transmit information from a compromised system when the access control policy of the system does not allow doing so. Previous work has shown that CPU frequency scaling can be used as a covert channel to transmit information between otherwise isolated processes. Modern systems either try to save power or try to operate near their power limits in order to maximize performance, so they implement mechanisms to vary the frequency based on load. Existing covert channels based on this approach are either easily thwarted by software countermeasures or only work on completely idle systems. In this paper, we show how the automatic frequency scaling provided by Intel Turbo Boost can be used to construct a covert channel that is both hard to prevent without significant performance impact and can tolerate significant background system load. As Intel Turbo Boost selects the maximum CPU frequency based on the number of active cores, our covert channel modulates information onto the maximum CPU frequency by placing load on multiple additional CPU cores. Our prototype of the covert channel achieves a throughput of up to 61 bit/s on an idle system and up to 43 bit/s on a system with 25% utilization.

The Impact of Turbo Frequency on the Energy, Performance, and Aging of Parallel Applications

Conference Paper

Oct 2019

Technologies that improve the performance of parallel applications by increasing the nominal operating frequency of processors respecting a given TDP (Thermal Design Power) have been widely used. However, they may impact on other non-functional requirements in different ways (e.g. increasing energy consumption or aging). Therefore, considering the huge number of configurations available, represented by the range of all possible combinations among different parallel applications, amount of threads, dynamic voltage and frequency scaling (DVFS) governors, boosting technologies and simultaneous multithreading (SMT), selecting the one that offers the best trade-off for a non-functional requirement is extremely challenging for software designers. Given that, in this work we assess the impact of changing these configurations on the energy consumption, performance, and aging of parallel applications on a turbo-compliant processor. Results show that there is no single configuration that would provide the best solution for all non-functional requirements at once. For instance, we demonstrate that the configuration that offers the best performance is the same one that has the worst impact on aging, accelerating it by up to 1.75 times. With our experiments, we provide guidelines for the developer when it comes to tuning performance using turbo boosting to save as much energy as possible and increase the lifespan of the hardware components

VIP: Virtual Performance-State for Efficient Power Management of Virtual Machines

Conference Paper

Oct 2018

A power management policy aims to improve energy efficiency by choosing an appropriate performance (voltage/frequency) state for a given core. In current virtualized environments, multiple virtual machines (VMs) running on the same core must follow a single power management policy governed by the hypervisor. However, we observe that such a per-core power management policy has two limitations. First, it cannot offer the flexibility of choosing a desirable power management policy for each VM (or client). Second, it often hurts the power efficiency of some or even all VMs especially when the VMs desire conflicting power management policies. To tackle these limitations, we propose a per-VM power management mechanism, VIP supporting Virtual Performance-state for each VM. Specifically, for VMs sharing a core, VIP allows each VM's guest OS to deploy its own desired power management policy while preventing such VMs from interfering/influencing each other's power management policy. That is, VIP can also facilitate a pricing model based on the choice of a power management policy. Second, identifying some inefficiency in strictly enforcing per-VM power management policies, we propose hypervisor-assisted techniques to further improve power and energy efficiency without compromising the key benefits of per-VM power management. To demonstrate the efficacy of VIP, we take a case that some VMs run CPU-intensive applications and other VMs run latency-sensitive applications sharing the same cores. Our evaluation shows that VIP reduces the overall energy consumption and improves the execution time of CPU-intensive applications compared with the default ondemand governor of Xen hypervisor up to 27% and 32%, respectively, without violating service level agreement (SLA) of latency-sensitive applications.

Stratégie de Placement et d’Ordonnancement de Tâches Logicielles pour les Architectures Reconfigurables sous Contrainte Énergétique

Thesis

Jun 2018

Aymen Gammoudi

La conception de systèmes temps-réel embarqués se développe de plus en plus avec l’intégration croissante de fonctionnalités critiques pour les applications de surveillance, notamment dans le domaine biomédical, environnemental, domotique, etc. Le développement de ces systèmes doit relever divers défis en termes de minimisation de la consommation énergétique. Gérer de tels dispositifs embarqués, entièrement autonomes, nécessite cependant de résoudre différents problèmes liés à la quantité d’énergie disponible dans la batterie, à l’ordonnancement temps-réel des tâches qui doivent être exécutées avant leurs échéances, aux scénarios de reconfiguration, particulièrement dans le cas d’ajout de tâches, et à la contrainte de communication pour pouvoir assurer l’échange des messages entre les processeurs, de façon à assurer une autonomie durable jusqu’à la prochaine recharge et ce, tout en maintenant un niveau de qualité de service acceptable du système de traitement. Pour traiter cette problématique, nous proposons dans ces travaux une stratégie de placement et d’ordonnancement de tâches permettant d’exécuter des applications temps-réel sur une architecture contenant des cœurs hétérogènes. Dans cette thèse, nous avons choisi d’aborder cette problématique de façon incrémentale pour traiter progressivement les problèmes liés aux contraintes temps-réel, énergétique et de communications. Tout d’abord, nous nous intéressons particulièrement à l’ordonnancement des tâches sur une architecture mono-cœur. Nous proposons une stratégie d’ordonnancement basée sur le regroupement des tâches dans des packs pour pouvoir calculer facilement les nouveaux paramètres des tâches afin de réobtenir la faisabilité du système. Puis, nous l’avons étendu pour traiter le cas de l’ordonnancement sur une architecture multi-cœurs homogènes. Finalement, une extension de ce dernier sera réalisée afin d’arriver à l’objectif principal qui est l’ordonnancement des tâches pour les architectures hétérogènes. L’idée est de prendre progressivement en compte des contraintes d’exécution de plus en plus complexes. Nous formalisons tous les problèmes en utilisant la formulation ILP afin de pouvoir produire des résultats optimaux. L’idée est de pouvoir situer nos solutions proposées par rapport aux solutions optimales produites par un solveur et par rapport aux autres algorithmes de l’état de l’art. Par ailleurs, la validation par simulation des stratégies proposées montre qu’elles engendrent un gain appréciable vis-à-vis des critères considérés importants dans les systèmes embarqués, notamment le coût de la communication entre cœurs et le taux de rejet des tâches.

Specification of the AMD and Intel processors.

Citations