DVFS controller block diagram.

DVFS controller block diagram.

Source publication
Article
Full-text available
A 167-processor computational platform consists of an array of simple programmable processors capable of per-pro- cessor dynamic supply voltage and clock frequency scaling, three algorithm-specific processors, and three 16 KB shared memories; and is implemented in 65 nm CMOS. All processors and shared memories are clocked by local fully independent...

Citations

... Since 2005, the industry of semiconductor has switched to multi-core and many-core processors for efficient utilization of the enormous number of per-chip transistors. The potential of many-core processors with network-on-chip interconnects has been proven for high-performance and energy-efficient computing [4], [5]. ...
Article
Full-text available
Integrating more cores per chip enables more programs to run simultaneously, and more easily switch from one program to another, and the system performance will be improved significantly. However, this current trend of central processing unit (CPU) performance cannot be maintained since the budget of power per chip has not risen while the consumption of power per core has slowly reduced. Generally, the processor’s maximum performance is proportional to the product of the number of their cores and the frequency they are running at. However, this is usually limited by constraints of power. In this study, first, the voltage/frequency adjustment of the running cores has been analyzed for several programs to improve the processor’s performance within the constraint of power. Second, the impact of dynamically scaling the number of running cores is summarized for additional performance improvements of the active programs and applications. Finally, it has been verified that scaling the number of the running cores and their voltage/frequency simultaneously can improve the processor’s performance for a higher power dissipation or under power constraints. The performance analysis and improvements are obtained in a real-time simulation on a Linux operating system using a GEM5 simulator. Results indicated that performance improvement was attained at 59.98%, 33.33%, and 66.65% for the three scenarios, respectively.
... The 167-processor architecture [127] was architecturally similar to both a CGRA and a conventional multi-core processor, and we include it here since the processing elements are simple, and communication between them is only performed using direct (yet dynamically configured) connectivity (and not through shared-memory or cache coherence as done in multi-core). The main focus of this work is to reduce power consumption through a series of advanced lowlevel optimizations (DVFS, clock generation and distribution, GALS [128], etc.). ...
Article
Full-text available
With the end of both Dennard’s scaling and Moore’s law, computer users and researchers are aggressively exploring alternative forms of computing in order to continue the performance scaling that we have come to enjoy. Among the more salient and practical of the post-Moore alternatives are reconfigurable systems, with Coarse-Grained Reconfigurable Architectures (CGRAs) seemingly capable of striking a balance between performance and programmability. In this paper, we survey the landscape of CGRAs. We summarize nearly three decades of literature on the subject, with a particular focus on the premise behind the different CGRAs and how they have evolved. Next, we compile metrics of available CGRAs and analyze their performance properties in order to understand and discover knowledge gaps and opportunities for future CGRA research specialized towards High-Performance Computing (HPC). We find that there are ample opportunities for future research on CGRAs, in particular with respect to size, functionality, support for parallel programming models, and to evaluate more complex applications.
... The 167-processor architecture [121] borderlines CGRAs and conventional multi-core processors, but we include it here since the processing elements are simple and communication between them is only performed using direct (yet dynamically configured) connectivity (and not through shared-memory or cache coherence as done in multi-core). The main focus behind this work is low-power, and through a series of advanced lowlevel optimizations (DVFS, clock generation and distribution, GALS [122] etc.) ...
Preprint
With the end of both Dennard's scaling and Moore's law, computer users and researchers are aggressively exploring alternative forms of compute in order to continue the performance scaling that we have come to enjoy. Among the more salient and practical of the post-Moore alternatives are reconfigurable systems, with Coarse-Grained Reconfigurable Architectures (CGRAs) seemingly capable of striking a balance between performance and programmability. In this paper, we survey the landscape of CGRAs. We summarize nearly three decades of literature on the subject, with particular focus on premises behind the different CGRA architectures and how they have evolved. Next, we compile metrics of available CGRAs and analyze their performance properties in order to understand and discover existing knowledge gaps and opportunities for future CGRA research specialized towards High-Performance Computing (HPC). We find that there are ample opportunities for future research on CGRAs, in particular with respect to size, functionality, support for parallel programming models, and to evaluate more complex applications.
... One of the typical architecture of a NoC for a system-on-a-chip is described in [1]. NoCs [2][3][4][5] provide a communication link between multiple modules such as central processing and graphics processing units, field-programmable gate arrays, memory devices, peripheral intellectual properties (IPs), and digital signal processors [1]. ...
Article
Full-text available
A simultaneous and reconfigurable multi-level RF-interconnect (MRI) for global network-on-chip (NoC) communication is demonstrated. The proposed MRI interface consists of baseband (BB) and RF band transceivers. The BB transceiver uses multi-level signaling (MLS) to enhance communication bandwidth. The RF-band transceiver utilizes amplitude-shift keying (ASK) modulation to support simultaneous communication on a shared single-ended on-chip global interconnect. A phase-locked loop (PLL) using a sub-harmonic multiply-by-10 injection-locked frequency multiplier is also designed to support a fully-synchronous NoC architecture. A differential voltage-controlled oscillator (VCO) used in a PLL creates an output frequency for a frequency range between 0.5 and 2.65 GHz signal. The multiply-by-10 ILFM generates 10 times higher frequency than the VCO output signal. Using the proposed multiply-by-10 ILFM can minimize the number of power-hungry frequency divider stages in a PLL feedback loop, resulting in improvement of the MRI power efficiency. The MLS-based BB and ASK-based RF band carry 10 Gb/s/pin and 4.4 Gb/s/pin, respectively. The proposed system is fabricated in a 65 nm CMOS process and achieves an energy efficiency of 2 pJ/b.
... With a fine-grain DVFS, each core can be set working at a different operating point in the [ , ] space; this allows to run multiple tasks asynchronously and bring down the minimum-energy point of the whole SoC. An example of fine-grain DVFS on massively parallel platforms is given in [153], where 167 processors are orchestrated over a wide frequency range achieving minimum power consumptions, from 1.07 GHz -47.5 mW at 1.2 V to 66 MHz -608 W at 0.675 V. As an add-on feature, fine-grain DVFS is a perfect knob to compensate and/or mitigate variations due to Process, Voltage and Temperature (PVT) fluctuations 20 ...
... Unfortunately, the use of integrated DC/DC converters is made impractical due to high implementation costs. Indeed, on-chip DC/DC converters fabricated with today's technologies may occupy a considerable silicon area due to the low integration density of the components they contain, e.g., capacitors and inductors [153] [131]. The picture gets even more complicated if one considers that every single core should be equipped with a dedicated converter. ...
Thesis
Full-text available
Energy efficiency has become the main constraint for most of today’s information and communication technologies, from those involving high-performance computing (e.g., cloud services) to those deployed on low-power applications (e.g., portable systems for the Internet-of-Things). In the past decades, the pursuit of energy efficiency was mainly supported through the advance of the underlying CMOS technology. Moving towards a new node was the guarantee to achieve more than 90% of energy savings. However, as soon as the CMOS entered the nanometric regime, improvements brought by a technology shift have shrunk substantially, reaching 20% and then further less generation by generation. To make matters worse, production costs raised dramatically due to the technological impediments imposed by physical geometries below the 28 nm mark. This made technology scaling impractical for many cost-sensitive applications. New sophisticated energy-aware design practices were then introduced to alleviate the suffering of a slow technology scaling. Very soon, low-power and energy-management techniques become the actual kernel of any design and optimization flow. Unfortunately, also design techniques are not fully renewable, namely, their effectiveness degrades with the advance of the technology nodes. This is the case of voltage scaling, for instance, which encountered the 1.0 V plateau that still holds today, but also other architectural-level techniques, such as multi-core/many-core solutions, which have been seriously limited by stringent dark-silicon constraints. The end of Moore’s law is not just a technology issue; it is also the prelude of a design crisis that will soon require to rethink the optimization and integration strategy of digital circuits and systems. A radical solution to all these concerns has to come yet. However, the recent growth of data-centric applications is opening to new design paradigms that alleviate the pressure. Much room is at the application-level indeed, where alternative energy-management knobs are available. The basic idea is that of integrating the quality-of-results as a new dimension in the design space. Leveraging the intrinsic error-resilience of data-centric applications, it is thereby possible to implement an Energy-Accuracy Scaling (EAS) which is orthogonal to the technology adopted and the low-power design strategy deployed. At the basis of this concept, there is the simple intuition that an application whose output can be degraded without affecting the quality perceived by the user may require lower energy consumption for the same amount of work. The broad objective of this dissertation is to introduce advanced design solutions that improve the approach the EAS paradigm is implemented. Two new strategies are presented which reduce the design overhead of classical approximate solutions; according to the revisited taxonomy introduced in this thesis, one of the proposed strategies belongs to the class of Adaptive EAS, while the second falls under the label of Static EAS. With Adaptive EAS, the optimal energy-accuracy tradeoff is achieved by measuring some quality metrics directly on-chip, at run-time, establishing a feedback loop that drives the energy minimization. These metrics can be obtained by explicitly measuring the output accuracy, or by indirect measurements, e.g., through the output error rate. With Static EAS, the energy-accuracy tradeoff is fixed at design-time by functional speculation, i.e., a modification of the logic functionality through algorithmic or circuit simplifications which induce energy savings for a worst-case accuracy loss. The Adaptive solution encompasses the extension of the conventional Error DetectionCorrection techniques for data-driven voltage scaling in order to trade system accuracy for energy reduction. The new mechanism, called Approximate Error DetectionCorrection (AED-C), is built upon in-situ elastic timing monitors which allow to implement a lightweight error management scheme.The AED-C implements EAS using the error detection coverage as a knob: a low error coverage accelerates supply voltage over-scaling thus to achieve more significant energy savings at the cost of quality-ofresult; a high error coverage lessens the voltage scaling leading to higher accuracy at the cost of lower energy savings. As EAS does not have to ensure full error coverage, the traditionally large area/energy overhead of conventional techniques is drastically reduced. Simulations over a representative set of applications/circuits, e.g., MultiplyAccumulate (MAC) unit, Discrete Cosine Transform (DCT), FIR and IIR filters, provide a comparative analysis with the state-of-the-art techniques. The collected results show that AED-C substantially reduces the average energy-per-operation and the area overhead, still guaranteeing reasonable accuracy. The static EAS strategy, instead, is developed exploiting Machine Learning theories which suggest alternative forms to represent relationships among data. Such theories find their application in the Boolean domain, where logic functions can be described as inference rules. The novel paradigm, named as Inferential Logic, leverages the concept of statistical inference for the design of combinational logic circuits that are able to mimic Boolean functions to a certain degree of accuracy. These inferential logic circuits run quasi-exact computations trading energy efficiency for accuracy in error-resilient applications. The figures-of-merit of an Inferential Multiplier are quantified using representative image processing applications as a case study. A comparative analysis against a state-of-the-art Booth Multiplier proves the inferential logic representation simplifies the circuit complexity reducing the overall area/delay. As a result, the inferential multiplier can exploit latency reduction for power optimization guaranteeing a fixed average accuracy.
... Moreover, the energy efficiency has been a primary design objective in embedded systems [31]. These multi/many-core (expected to have 1000+ cores on a single chip [32]) architectures have been demonstrated as a promising solution for energy-efficient computing [33,34]. ...
... Energy efficiency and high performance requirements have been a key research focus of processor designers in multi-core platforms [34]. MediaTek helio X20 [30] and Juno r2 [29] contain such an architecture. ...
... Efficient runtime management of multi-threaded applications on heterogeneous multicores is of paramount importance to achieve energy efficiency and high performance requirements, that have been a key research focus for computing systems [19,34,166]. ...
Thesis
Full-text available
Multi-core platforms are employing a greater number of heterogeneous cores and resource configurations to achieve energy-efficiency and high performance. These platforms often execute applications with different performance constraints concurrently, which contend for resources simultaneously, thereby generating varying workload and resources demands over time. There is a little reported work on runtime energy management of concurrent execution, focusing mostly on homogeneous multi-cores and limited application scenarios. This thesis considers both homogeneous and heterogeneous multi-cores and broadens application scenarios. The following contributions are made in this thesis. Firstly, this thesis presents online Dynamic Voltage and Frequency Scaling (DVFS) techniques for concurrent execution of single-threaded and multi-threaded applications on homogeneous multi-cores. This includes an experimental analysis and deriving metrics for efficient online workload classification. The DVFS level is proactively set through predicted workload, measured through Memory Reads Per Instruction. The analysis also considers thread synchronisation overheads, and underlying memory and DVFS architectures. Average energy savings of up to 60% are observed when evaluated on three different hardware platforms (Odroid-XU3, Intel Xeon E5-2630, and Xeon Phi 7620P). Next, an energy efficient static mapping and DVFS approach is proposed for heterogeneous multi-core CPUs. This approach simultaneously exploits different types of cores for each application in a concurrent execution scenario. It first selects performance meeting mapping (no. of cores and type) for each application having minimum energy consumption using offline results. Then online DVFS is applied to adapt to workload and performance variations. Compared to recent techniques, the proposed approach has an average of 33% lower energy consumption when validated on the Odroid-XU3. To eliminate dependency on the offline application profiling and to adapt to dynamic application arrival/completion, an adaptive mapping approach coupled with DVFS is presented. This is achieved through an accurate performance model, and an energy efficient resource selection technique and a resource manager. Experimental evaluation on the Odroid-XU3 shows an improvement of up to 28% in energy efficiency and 7.9% better prediction accuracy by performance models.<br/
... The University of California at Davis develops smart ASAP (Asynchronous Array of Simple Processors) [42]. The objective of the project is to achieve signal processing computations on small processors within limited power budgets. ...
Thesis
Software defined radio (SDR) is a promising technology to tackle flexibility requirements of new generations of communication standards. It can be easily reprogrammed at a software level to implement different waveforms. When relying on a software-based technology such as microprocessors, this approach is clearly flexible and quite easy to design. However, it usually provides low computing capability and therefore low throughput performance. To tackle this issue, FPGA technology turns out to be a good alternative for implementing SDRs. Indeed, FPGAs have both high computing power and reconfiguration capacity. Thus, including FPGAs into the SDR concept may allow to support more waveforms with more strict requirements than a processor-based approach. However, main drawbacks of FPGA design are the level of the input description language that basically needs to be the hardware level, and, the reconfiguration time that may exceed run-time requirements if the complete FPGA is reconfigured. To overcome these issues, this PhD thesis proposes a design methodology that leverages both high-level synthesis tools and dynamic reconfiguration. The proposed methodology is a guideline to completely build a flexible radio for FPGA-based SDR, which can be reconfigured at run-time.
... Continuous development of modern systems on chip (SoC) has led to the emergence of multiprocessor systems. For example, Intel has developed two experimental processors with 48 and 80 cores (Howard et al., 2011); a processor with 167 cores is being developed (Truong et al., 2009); ZMS-40 100-Core StemCell Media processors with Quad ARM Cortex-A9 cores (Huangfu & Zhang, 2015; "ZiiLABS unveils 100-Core ZMS-40 processor", 2012), TILE-Gx72 with 72 C-programmable 64-bit RISC cores processor and TILE-Mx100 targeting networking with 100 64-bit ARM Cortex-A53 cores processor ("Mellanox Products: TILE-Gx72 Processor", 2016) is commercially available. Other companies also pursue their ongoing developments. ...
Article
Full-text available
This article describes how actual trends of networks-on-chip research and known approaches to their modeling are considered. The characteristics of analytic and high- / low- level simulation are given. The programming language SystemC as an alternative solution to create models of networks-on-chip is proposed, and SystemC models speed increase methodic is observed. The methods of improving SystemC models are formulated. There has been shown how SystemC language can reduce the disadvantages and maximize the advantages of high-level and low-level approaches. To achieve this, the comparison of results for high-level, low-level and SystemC NoC simulation is given on the example of “hot spots” and the geometric shape of regular NoC topologies effect on their productivity.
... Finally, voltage scalability is important for supporting dynamic voltage frequency scaling (DVFS) systems [24,25]. DVFS systems can provide peak performance when workload is heavy by operating a processor at nominal supply voltage (V DD ). ...
Article
Full-text available
This paper presents an on-chip temperature sensor circuit for dynamic thermal management in VLSI systems. The sensor directly senses the threshold voltage that contains temperature information using a single PMOS device. This simple structure enables the sensor to achieve an ultra-compact footprint. The sensor also exhibits high accuracy and voltage-scalability down to 0.4 V, allowing the sensor to be used in dynamic voltage frequency scaling systems without requiring extra power distribution or regulation. The compact footprint and voltage scalability enables our proposed sensor to be implemented in a digital standard-cell format, allowing aggressive sensor placement very close to target hotspots in digital blocks. The proposed sensor frontend prototyped in a 65 nm CMOS technology has a footprint of 30.1 µm2, 3σ-error of ±1.1 °C across 0 to 100 °C after one temperature point calibration, marking a significant improvement over existing sensors designed for dynamic thermal management in VLSI systems.
... Mesh is one of the most extensively investigated topologies for the communication architecture of multiprocessor system-on-chip (MPSoC). In particularly, the mesh-connected VLSI processor array is a type of a massively parallel system, such as RAW [1], AsAP [2], TILE64 [3], AsAP2 [4], which is suitable for high-speed implementation of the majority of signal and image processing algorithms. However, with increasing VLSI array density, the probability of a malfunction occurring in the array also increases. ...
... Given an m × n host array, the Abstraction initiates from row R 1 to the R m . First, R 1 is confirmed as an abstract row, represented e (1,1) e (1,2) e (3,2) e (3,3) e (2,2) e (1,1) e (1,2) e (3,2) e (3,3) Fig. 4. Example of the communication collision by R 1 . ...
Article
Employment of fault-tolerant techniques is necessary for the operability and reliability of very large scale integration (VLSI) arrays. Moreover, fault tolerance must be obtained at high speed on real-time embedded systems. In this paper, a novel abstract model, based on the abstraction technique, is proposed to accelerate the reconfiguration of degradable VLSI arrays with low fault density. Given an m × n physical array with faults, the proposed technique extracts some physical rows including faulty elements to construct an abstraction array, ignoring the physical rows without faults, such that the size of original physical array can be significantly reduced to speed up the reconfiguration of VLSI array. In addition, we conclusively demonstrate the preservation of properties between the abstract array and the physical array to guarantee that the proposed technique is correct. Experimental results show that for a 128 × 128 physical array with 0.1% faults, a 15 × 128 abstract array can be constructed and the compression is more than 88.20%. This abstraction technique is bound into a reconfiguration algorithm cited in this paper. Simulation results show that the running time can be improved by more than 66.79% for a 128 × 128 physical array with 0.1% faults. © 2017 Institute of Electrical Engineers of Japan. Published by John Wiley & Sons, Inc.