Fig. 15. DVFS controller block diagram.

Performance analysis of multicore processors using multi-scaling techniques

Article

Full-text available

Jun 2023
IJECE

Integrating more cores per chip enables more programs to run simultaneously, and more easily switch from one program to another, and the system performance will be improved significantly. However, this current trend of central processing unit (CPU) performance cannot be maintained since the budget of power per chip has not risen while the consumption of power per core has slowly reduced. Generally, the processor’s maximum performance is proportional to the product of the number of their cores and the frequency they are running at. However, this is usually limited by constraints of power. In this study, first, the voltage/frequency adjustment of the running cores has been analyzed for several programs to improve the processor’s performance within the constraint of power. Second, the impact of dynamically scaling the number of running cores is summarized for additional performance improvements of the active programs and applications. Finally, it has been verified that scaling the number of the running cores and their voltage/frequency simultaneously can improve the processor’s performance for a higher power dissipation or under power constraints. The performance analysis and improvements are obtained in a real-time simulation on a Linux operating system using a GEM5 simulator. Results indicated that performance improvement was attained at 59.98%, 33.33%, and 66.65% for the three scenarios, respectively.

A Survey on Coarse-Grained Reconfigurable Architectures From a Performance Perspective

Article

Full-text available

Jul 2020

With the end of both Dennard’s scaling and Moore’s law, computer users and researchers are aggressively exploring alternative forms of computing in order to continue the performance scaling that we have come to enjoy. Among the more salient and practical of the post-Moore alternatives are reconfigurable systems, with Coarse-Grained Reconfigurable Architectures (CGRAs) seemingly capable of striking a balance between performance and programmability. In this paper, we survey the landscape of CGRAs. We summarize nearly three decades of literature on the subject, with a particular focus on the premise behind the different CGRAs and how they have evolved. Next, we compile metrics of available CGRAs and analyze their performance properties in order to understand and discover knowledge gaps and opportunities for future CGRA research specialized towards High-Performance Computing (HPC). We find that there are ample opportunities for future research on CGRAs, in particular with respect to size, functionality, support for parallel programming models, and to evaluate more complex applications.

A Survey on Coarse-Grained Reconfigurable Architectures from a Performance Perspective

Preprint

Apr 2020

With the end of both Dennard's scaling and Moore's law, computer users and researchers are aggressively exploring alternative forms of compute in order to continue the performance scaling that we have come to enjoy. Among the more salient and practical of the post-Moore alternatives are reconfigurable systems, with Coarse-Grained Reconfigurable Architectures (CGRAs) seemingly capable of striking a balance between performance and programmability. In this paper, we survey the landscape of CGRAs. We summarize nearly three decades of literature on the subject, with particular focus on premises behind the different CGRA architectures and how they have evolved. Next, we compile metrics of available CGRAs and analyze their performance properties in order to understand and discover existing knowledge gaps and opportunities for future CGRA research specialized towards High-Performance Computing (HPC). We find that there are ample opportunities for future research on CGRAs, in particular with respect to size, functionality, support for parallel programming models, and to evaluate more complex applications.

An energy-efficient multi-level RF-interconnect for global network-on-chip communication

Article

Full-text available

Jan 2020
ANALOG INTEGR CIRC S

A simultaneous and reconfigurable multi-level RF-interconnect (MRI) for global network-on-chip (NoC) communication is demonstrated. The proposed MRI interface consists of baseband (BB) and RF band transceivers. The BB transceiver uses multi-level signaling (MLS) to enhance communication bandwidth. The RF-band transceiver utilizes amplitude-shift keying (ASK) modulation to support simultaneous communication on a shared single-ended on-chip global interconnect. A phase-locked loop (PLL) using a sub-harmonic multiply-by-10 injection-locked frequency multiplier is also designed to support a fully-synchronous NoC architecture. A differential voltage-controlled oscillator (VCO) used in a PLL creates an output frequency for a frequency range between 0.5 and 2.65 GHz signal. The multiply-by-10 ILFM generates 10 times higher frequency than the VCO output signal. Using the proposed multiply-by-10 ILFM can minimize the number of power-hungry frequency divider stages in a PLL feedback loop, resulting in improvement of the MRI power efficiency. The MLS-based BB and ASK-based RF band carry 10 Gb/s/pin and 4.4 Gb/s/pin, respectively. The proposed system is fabricated in a 65 nm CMOS process and achieves an energy efficiency of 2 pJ/b.

Energy-Accuracy Scaling in Digital ICs: Static and Adaptive Design Methods and Tools

Thesis

Full-text available

Jul 2019

Roberto Giorgio Rizzo

Energy efficiency has become the main constraint for most of today’s information and communication technologies, from those involving high-performance computing (e.g., cloud services) to those deployed on low-power applications (e.g., portable systems for the Internet-of-Things). In the past decades, the pursuit of energy efficiency was mainly supported through the advance of the underlying CMOS technology. Moving towards a new node was the guarantee to achieve more than 90% of energy savings. However, as soon as the CMOS entered the nanometric regime, improvements brought by a technology shift have shrunk substantially, reaching 20% and then further less generation by generation. To make matters worse, production costs raised dramatically due to the technological impediments imposed by physical geometries below the 28 nm mark. This made technology scaling impractical for many cost-sensitive applications. New sophisticated energy-aware design practices were then introduced to alleviate the suffering of a slow technology scaling. Very soon, low-power and energy-management techniques become the actual kernel of any design and optimization flow. Unfortunately, also design techniques are not fully renewable, namely, their effectiveness degrades with the advance of the technology nodes. This is the case of voltage scaling, for instance, which encountered the 1.0 V plateau that still holds today, but also other architectural-level techniques, such as multi-core/many-core solutions, which have been seriously limited by stringent dark-silicon constraints. The end of Moore’s law is not just a technology issue; it is also the prelude of a design crisis that will soon require to rethink the optimization and integration strategy of digital circuits and systems. A radical solution to all these concerns has to come yet. However, the recent growth of data-centric applications is opening to new design paradigms that alleviate the pressure. Much room is at the application-level indeed, where alternative energy-management knobs are available. The basic idea is that of integrating the quality-of-results as a new dimension in the design space. Leveraging the intrinsic error-resilience of data-centric applications, it is thereby possible to implement an Energy-Accuracy Scaling (EAS) which is orthogonal to the technology adopted and the low-power design strategy deployed. At the basis of this concept, there is the simple intuition that an application whose output can be degraded without affecting the quality perceived by the user may require lower energy consumption for the same amount of work. The broad objective of this dissertation is to introduce advanced design solutions that improve the approach the EAS paradigm is implemented. Two new strategies are presented which reduce the design overhead of classical approximate solutions; according to the revisited taxonomy introduced in this thesis, one of the proposed strategies belongs to the class of Adaptive EAS, while the second falls under the label of Static EAS. With Adaptive EAS, the optimal energy-accuracy tradeoff is achieved by measuring some quality metrics directly on-chip, at run-time, establishing a feedback loop that drives the energy minimization. These metrics can be obtained by explicitly measuring the output accuracy, or by indirect measurements, e.g., through the output error rate. With Static EAS, the energy-accuracy tradeoff is fixed at design-time by functional speculation, i.e., a modification of the logic functionality through algorithmic or circuit simplifications which induce energy savings for a worst-case accuracy loss. The Adaptive solution encompasses the extension of the conventional Error DetectionCorrection techniques for data-driven voltage scaling in order to trade system accuracy for energy reduction. The new mechanism, called Approximate Error DetectionCorrection (AED-C), is built upon in-situ elastic timing monitors which allow to implement a lightweight error management scheme.The AED-C implements EAS using the error detection coverage as a knob: a low error coverage accelerates supply voltage over-scaling thus to achieve more significant energy savings at the cost of quality-ofresult; a high error coverage lessens the voltage scaling leading to higher accuracy at the cost of lower energy savings. As EAS does not have to ensure full error coverage, the traditionally large area/energy overhead of conventional techniques is drastically reduced. Simulations over a representative set of applications/circuits, e.g., MultiplyAccumulate (MAC) unit, Discrete Cosine Transform (DCT), FIR and IIR filters, provide a comparative analysis with the state-of-the-art techniques. The collected results show that AED-C substantially reduces the average energy-per-operation and the area overhead, still guaranteeing reasonable accuracy. The static EAS strategy, instead, is developed exploiting Machine Learning theories which suggest alternative forms to represent relationships among data. Such theories find their application in the Boolean domain, where logic functions can be described as inference rules. The novel paradigm, named as Inferential Logic, leverages the concept of statistical inference for the design of combinational logic circuits that are able to mimic Boolean functions to a certain degree of accuracy. These inferential logic circuits run quasi-exact computations trading energy efficiency for accuracy in error-resilient applications. The figures-of-merit of an Inferential Multiplier are quantified using representative image processing applications as a case study. A comparative analysis against a state-of-the-art Booth Multiplier proves the inferential logic representation simplifies the circuit complexity reducing the overall area/delay. As a result, the inferential multiplier can exploit latency reduction for power optimization guaranteeing a fixed average accuracy.

Runtime energy management of concurrent applications for multi-core platforms

Thesis

Full-text available

Apr 2019

Karunakar Reddy Basireddy

Multi-core platforms are employing a greater number of heterogeneous cores and resource configurations to achieve energy-efficiency and high performance. These platforms often execute applications with different performance constraints concurrently, which contend for resources simultaneously, thereby generating varying workload and resources demands over time. There is a little reported work on runtime energy management of concurrent execution, focusing mostly on homogeneous multi-cores and limited application scenarios. This thesis considers both homogeneous and heterogeneous multi-cores and broadens application scenarios. The following contributions are made in this thesis. Firstly, this thesis presents online Dynamic Voltage and Frequency Scaling (DVFS) techniques for concurrent execution of single-threaded and multi-threaded applications on homogeneous multi-cores. This includes an experimental analysis and deriving metrics for efficient online workload classification. The DVFS level is proactively set through predicted workload, measured through Memory Reads Per Instruction. The analysis also considers thread synchronisation overheads, and underlying memory and DVFS architectures. Average energy savings of up to 60% are observed when evaluated on three different hardware platforms (Odroid-XU3, Intel Xeon E5-2630, and Xeon Phi 7620P). Next, an energy efficient static mapping and DVFS approach is proposed for heterogeneous multi-core CPUs. This approach simultaneously exploits different types of cores for each application in a concurrent execution scenario. It first selects performance meeting mapping (no. of cores and type) for each application having minimum energy consumption using offline results. Then online DVFS is applied to adapt to workload and performance variations. Compared to recent techniques, the proposed approach has an average of 33% lower energy consumption when validated on the Odroid-XU3. To eliminate dependency on the offline application profiling and to adapt to dynamic application arrival/completion, an adaptive mapping approach coupled with DVFS is presented. This is achieved through an accurate performance model, and an energy efficient resource selection technique and a resource manager. Experimental evaluation on the Odroid-XU3 shows an improvement of up to 28% in energy efficiency and 7.9% better prediction accuracy by performance models.<br/

Towards Hardware Synthesis of a Flexible Radio from a High-Level Language

Thesis

Dec 2018

Mai-Thanh Tran

Software defined radio (SDR) is a promising technology to tackle flexibility requirements of new generations of communication standards. It can be easily reprogrammed at a software level to implement different waveforms. When relying on a software-based technology such as microprocessors, this approach is clearly flexible and quite easy to design. However, it usually provides low computing capability and therefore low throughput performance. To tackle this issue, FPGA technology turns out to be a good alternative for implementing SDRs. Indeed, FPGAs have both high computing power and reconfiguration capacity. Thus, including FPGAs into the SDR concept may allow to support more waveforms with more strict requirements than a processor-based approach. However, main drawbacks of FPGA design are the level of the input description language that basically needs to be the hardware level, and, the reconfiguration time that may exceed run-time requirements if the complete FPGA is reconfigured. To overcome these issues, this PhD thesis proposes a design methodology that leverages both high-level synthesis tools and dynamic reconfiguration. The proposed methodology is a guideline to completely build a flexible radio for FPGA-based SDR, which can be reconfigured at run-time.

SystemC Language Usage as the Alternative to the HDL and High-level Modeling for NoC Simulation

Article

Full-text available

Jul 2018

This article describes how actual trends of networks-on-chip research and known approaches to their modeling are considered. The characteristics of analytic and high- / low- level simulation are given. The programming language SystemC as an alternative solution to create models of networks-on-chip is proposed, and SystemC models speed increase methodic is observed. The methods of improving SystemC models are formulated. There has been shown how SystemC language can reduce the disadvantages and maximize the advantages of high-level and low-level approaches. To achieve this, the comparison of results for high-level, low-level and SystemC NoC simulation is given on the example of “hot spots” and the geometric shape of regular NoC topologies effect on their productivity.

A Sub-50 µm2, Voltage-Scalable, Digital-Standard-Cell-Compatible Thermal Sensor Frontend for On-Chip Thermal Monitoring

Article

Full-text available

May 2018

This paper presents an on-chip temperature sensor circuit for dynamic thermal management in VLSI systems. The sensor directly senses the threshold voltage that contains temperature information using a single PMOS device. This simple structure enables the sensor to achieve an ultra-compact footprint. The sensor also exhibits high accuracy and voltage-scalability down to 0.4 V, allowing the sensor to be used in dynamic voltage frequency scaling systems without requiring extra power distribution or regulation. The compact footprint and voltage scalability enables our proposed sensor to be implemented in a digital standard-cell format, allowing aggressive sensor placement very close to target hotspots in digital blocks. The proposed sensor frontend prototyped in a 65 nm CMOS technology has a footprint of 30.1 µm2, 3σ-error of ±1.1 °C across 0 to 100 °C after one temperature point calibration, marking a significant improvement over existing sensors designed for dynamic thermal management in VLSI systems.

Efficient abstraction algorithms for accelerating reconfiguration of VLSI arrays: ABSTRACTION RECONFIGURATION OF VLSI ARRAYS

Article

Jun 2017
IEEJ T ELECTR ELECTR

Employment of fault-tolerant techniques is necessary for the operability and reliability of very large scale integration (VLSI) arrays. Moreover, fault tolerance must be obtained at high speed on real-time embedded systems. In this paper, a novel abstract model, based on the abstraction technique, is proposed to accelerate the reconfiguration of degradable VLSI arrays with low fault density. Given an m × n physical array with faults, the proposed technique extracts some physical rows including faulty elements to construct an abstraction array, ignoring the physical rows without faults, such that the size of original physical array can be significantly reduced to speed up the reconfiguration of VLSI array. In addition, we conclusively demonstrate the preservation of properties between the abstract array and the physical array to guarantee that the proposed technique is correct. Experimental results show that for a 128 × 128 physical array with 0.1% faults, a 15 × 128 abstract array can be constructed and the compression is more than 88.20%. This abstraction technique is bound into a reconfiguration algorithm cited in this paper. Simulation results show that the running time can be improved by more than 66.79% for a 128 × 128 physical array with 0.1% faults. © 2017 Institute of Electrical Engineers of Japan. Published by John Wiley & Sons, Inc.

DVFS controller block diagram.

Citations