Figure 1 - uploaded by Peter Marwedel
Content may be subject to copyright.
Memory configuration flow diagram

Memory configuration flow diagram

Source publication
Technical Report
Full-text available
In this report we evaluate the options for low power on-chip memories during system design and configuration. Specifically, we compare the use of scratch pad memories with that of cache on the basis of performance, area and energy. The target architecture used in our experiments is the AT91M40400 microcontroller containing an ARM7TDMI core. A packi...

Contexts in source publication

Context 1
... have varied the cache and the scratch pad size from 64 bytes to 8192 bytes. For the benchmark suite that we have chosen, after a certain knee point around 1024 bytes, both for cache and scratch pad, there will not be any performance improvement by increasing their size. This is a reflection of the overall memory requirements of these benchmarks. Fig. 11 shows the performance variation for bubble sort, lattice and selection sort benchmarks for the two on-chip memory options in the same ...
Context 2
... area is represented in terms of number of transistors. These are obtained from the cache and scratch pad orga- nization using [9]. Fig. 12 shows the area vs. performance for the cache and the scratch pad for biquad_N_sections, Table 3 gives the area/performance tradeoff. Column 1 is the size of scratch pad and cache in bytes. Columns 2 and 3 are the cache and scratch pad area in transistors. Columns 4 and 5 are the number of CPU cycles for cache- and scratch-pad based ...
Context 3
... we describe the effect of varying address width on the energy for scratch pad and cache. Next, we give an example of the energy consumption required for a main memory access. Finally, we describe the total energy consumption for the various benchmarks that we have used in the experimental setup. Fig. 15 shows the graph of cache size, scratch pad size in bytes vs the energy estimates per access obtained from estimation using CACTI. The x axis is the size in bytes and the y axis represents energy in ...

Similar publications

Conference Paper
Full-text available
Wireless sensors networks is an active research topic. The sensor nodes (i.e. motes) are the main building blocks of these networks. There is a permanent concern for building more and more efficient motes in order to satisfy the demanding specifications of a sensor network. This paper introduces a new mote device: aceMOTE which is based on a 32 bit...
Preprint
Full-text available
Large Deep Neural Networks (DNNs) are the backbone of today's artificial intelligence due to their ability to make accurate predictions when being trained on huge datasets. With advancing technologies, such as the Internet of Things, interpreting large quantities of data generated by sensors is becoming an increasingly important task. However, in m...
Chapter
Full-text available
This article presents an extensive study which was performed on Infineon’s Aurix2G microcontroller to evaluate the impact of overclocking and underclocking on several parameters such as throughput, current, and power. Guided by the study, both overclocking and underclocking were utilized together to develop a dynamic frequency control approach to a...
Conference Paper
Full-text available
This paper presents initial experiment results toward realizing a multi-core CPU for wireless sensor nodes. The multi-core CPU reduces power consumption with enabling users to easily manage hard real-time tasks. The results show a sensor node with triple CPUs can eliminate about 76 % of power consumption compared to a single CPU sensor node.
Article
Full-text available
The article proposes the development of a low-cost, well-performing open source residential energy consumption demand meter compared to commercial meters in order to achieve accurate results and remain within the limits determined by technical standards, with the objective of measuring real and instant energy consumption through microcontroller and...

Citations

... There is such a relationship between the two: (Incoming times In -swapping out Number of times Out) = SPM size (KB). As the SPM size increases, the number of swap-in and swapout times caused by its operation is also significantly reduced, which is consistent with the principle of reducing the size loss by increasing the Cache size (Banakar et al., 2001). It can be seen from the C/H value that the random sampling algorithm can effectively guess the data set with the highest frequency of access in the program, namely the core working set. ...
Article
Traditional compiler-based SPM management often fails to accurately predict the memory access characteristics of system scheduling and task switching in a Multi-core Processor environment, thus affecting the effect of SPM management. The use of runtime dynamic detection can make up for this flaw and provide an accurate and efficient dynamic management method. This research focuses on the analysis of the similarities and differences between SPM management in Multi-core Processor environment and single-task environment, and builds a real-time operating system (RTOS) supporting multi-task scheduling according to experimental requirements, which is necessary for the random sampling of SPM allocation algorithm and improvements to meet the needs of adaptive SPM allocation for program runtime in a Multi-core Processor environment. The activity of the random sampling algorithm in Multi-core Processor environment is analysed which proves the effectiveness of the allocation algorithm for Multi-core Processor environment.
... In the last few decades, researchers have been exploring the possibility to adopt scratchpad memories (SPMs) as an alternative to CPU caches [1][2][3]: these are on-chip memories which are directly accessible by the processor -through specific load/store instructions -and do not need to be managed by the cache controller. Being intrinsically simpler, they have better latency and energy characteristics compared to caches of the same size [4,5]. This is also shown later in this article, with the data presented in Table 3, extracted with a widely employed memory modeling tool [6]. ...
... A technical report from Banakar et al. [5] presented a size-varying comparative analysis of caches and scratchpads, using the SPM to store frequently used portions of both code and data. They generated an energy and area model for both devices and extracted performance data through simulation, showing that a scratchpad wins in every aspect if the memory size is big enough. ...
... These unique weights (since there are very few of them) can be stored close to compute on memory units, processing element scratchpads or SRAM cache depending on the hardware flavour [19,20], all of which have minimal (almost free) access costs. The storage and data movement cost of the encoded weights plus the close-to-compute weight access should be smaller than the storage and movement costs of directly representing the network with the weight values. ...
Preprint
Full-text available
Modern iterations of deep learning models contain millions (billions) of unique parameters, each represented by a b-bit number. Popular attempts at compressing neural networks (such as pruning and quantisation) have shown that many of the parameters are superfluous, which we can remove (pruning) or express with less than b-bits (quantisation) without hindering performance. Here we look to go much further in minimising the information content of networks. Rather than a channel or layer-wise encoding, we look to lossless whole-network quantisation to minimise the entropy and number of unique parameters in a network. We propose a new method, which we call Weight Fixing Networks (WFN) that we design to realise four model outcome objectives: i) very few unique weights, ii) low-entropy weight encodings, iii) unique weight values which are amenable to energy-saving versions of hardware multiplication, and iv) lossless task-performance. Some of these goals are conflicting. To best balance these conflicts, we combine a few novel (and some well-trodden) tricks; a novel regularisation term, (i, ii) a view of clustering cost as relative distance change (i, ii, iv), and a focus on whole-network re-use of weights (i, iii). Our Imagenet experiments demonstrate lossless compression using 56x fewer unique weights and a 1.9x lower weight-space entropy than SOTA quantisation approaches.
... Consequently, evaluation of the performance compared to the cache based system is important. Banakar et al. [43] carried out a comprehensive evaluation between SPM and cache memories. However, their work focused on a single level shared SPM and cache memory, which is not a representative architecture used in modern high-performance microprocessors. ...
... We adopt the performance model described in [43] and formalize the timing relation between the SPMs and the caches in the following equations. In Eq. (1), T s1 and T c1 are the latencies of the first-level SPM and cache, respectively. ...
... The memory objects assignment algorithms used to offer the full-time predictability should avoid both intracore and inter-core conflicts. In the related work section, we mentioned that the ILP based method can avoid intracore conflicts for either the static based method [43] or the dynamic based method [15,16] (called the baseline methods). Consequently, it is attractive to extend the static and dynamic based ILP methods to support the proposed SPM based multicore architectures to maintain time predictability while maximizing performance/energy. ...
Article
Time predictability is crucial in hard real-time and safety-critical systems. Cache memories, while useful for improving the average-case memory performance, are not time predictable, especially when they are shared in multicore processors. To achieve time predictability while minimizing the impact on performance, this paper explores several time-predictable scratch-pad memory (SPM) based architectures for multicore processors. To support these architectures, we propose the dynamic memory objects allocation based partition, the static allocation based partition, and the static allocation based priority L2 SPM strategy to retain the characteristic of time predictability while attempting to maximize the performance and energy efficiency. The SPM based multicore architectural design and the related allocation methods thus form a comprehensive solution to hard real-time multicore based computing. Our experimental results indicate the strengths and weaknesses of each proposed architecture and the allocation method, which offers interesting on-chip memory design options to enable multicore platforms for hard real-time systems.
... SPM is considered similar to L1-cache but it has explicit instructions to move data from and to the main memory, often using DMA (Direct memory access) based data transfer. A comparative study [69,10] shows that the use of scratchpad memory instead of a cache gives an improvement of 18% in performance for bubble sort, a 34% reduction in chip area, and uses less energy per access because of the absence of tag comparisons. From here onwards in this paper we use the abbreviation SMC to denote these memories. ...
... There are also research efforts [32,17,16,18,23] where SMCs have been developed and tested for use in a GPP. The main advantage as mentioned in [69,10] of using SMCs are the savings they provide in area and energy. They can also accelerate the speed of a program because of the close proximity to the CPU. ...
Article
Full-text available
Processors are unable to achieve significant gains in speed using the conventional methods. For example increasing the clock rate increases the average access time to on-chip caches which in turn lowers the average number of instructions per cycle of the processor. On-chip memory system will be the major bottleneck in future processors. Software-managed on-chip memories (SMCs) are on-chip caches where software can explicitly read and write some or all of the memory references within a block of caches. This paper1 analyzes the current trends for optimizing the use of these SMCs. We separate and compare these trends based on general classifications developed during our study. The paper not only serves as a collection of recent references, information and classifications for easy comparison and analysis but also as a motivation for improving the SMC management framework for embedded systems. It will also make a first step towards making them useful for general purpose multicore processors.
... Banakar et al. [6] described a comprehensive evaluation between scratchpad and cache memories in their research. However, their work focused on a single-level shared scratchpad and cache memory, which is not a representative architecture used in modern high-performance microprocessors. ...
Article
In modern computer architectures, caches are widely used to shorten the gap between processor speed and memory access time. However, caches are time-unpredictable, and thus can significantly increase the complexity of worst-case execution time (WCET) analysis, which is crucial for real-time systems. This paper proposes a time-predictable two-level scratchpad-based architecture and an ILP-based static memory objects assignment algorithm to support real-time computing. Moreover, to exploit the load/store latencies that are known statically in this architecture, we study a Scratch-pad Sensitive Scheduling method to further improve the performance. Our experimental results indicate that the performance and energy consumption of the two-level scratchpad-based architecture are superior to the similar cache based architecture for most of the benchmarks we studied.
... When focusing on embedded systems and applications, one of the biggest advantages of caches, versatility, is often unneeded, while power consumption and die area (where caches are at disadvantage due to the overhead of tag area and control logic), play much more important roles. The authors of a detailed study [3] trading-off caches against SPMs found in their experiments that the latter exhibit 34 % smaller area and 40 % lower power consumption than a cache of the same capacity. Even more surprisingly, the runtime measured in cycles was 18 % better with an SPM using a simple static knapsack-based allocation algorithm. ...
Conference Paper
Full-text available
This paper presents an energy-aware electronic design automation (EDA) methodology for the system-level exploration of hierarchical storage organizations, focusing mainly on data-intensive signal processing applications. Starting from the high-level behavioral specification of a given application, several memory management tasks are addressed in a common algebraic framework, using data-dependence analysis techniques similar to those used in modern compilers. Within this memory management software system, the designer can explore different algorithmic specifications functionally equivalent, by computing their minimum storage requirements. The system can perform an exploration based on energy consumption of signal assignments to the off- and on-chip memory layers, followed by a storage-efficient mapping of signals to the physical memories. The last phase of the methodology is an exploration approach for energy-aware banking of the on-chip memory, which takes into account both the static and dynamic energy consumption.
... SPM is considered similar to L1-cache but it has explicit instructions to move data from and to the main memory, often using DMA (Direct memory access) based data transfer. A comparative study [55, 8] shows that the use of scratchpad memory instead of a cache gives an improvement of 18% in performance for bubble sort, a 34% reduction in chip area, and uses less energy per access because of the absence of tag comparisons. From here onwards in this paper we use the abbreviation SMC to denote these memories. ...
... There are also research efforts [26, 14, 13, 15] where SMCs have been developed and tested for use in a GPP. The main advantage as mentioned in [55, 8] of using SMCs are the savings they provide in area and energy. They can also accelerate the speed of a program because of the close proximity to the CPU. ...
Article
Full-text available
Processors are unable to achieve significant gains in speed using the conventional methods. For example increasing the clock rate increases the average access time to on-chip caches which in turn lowers the average number of instructions per cycle of the processor. On-chip memory system will be the major bottleneck in future processors. Software-managed on-chip memories (SMCs) are on-chip caches where software can explicitly read and write some or all of the memory references within a block of caches. This paper analyzes the current trends for optimizing the use of these SMCs. We separate and compare these trends based on general classifications developed during our study. The paper not only serves as a collection of recent references, information and classifications for easy comparison and analysis but also as a motivation for improving the SMC management framework for embedded systems. It will also make a first step towards making them useful for general purpose multicore processors.
... The consequence is that there is no need to check for the availability of the data in the SPM. Hence, the SPM does not possess a comparator and the miss/hit acknowledging circuitry [3]. This contributes to a significant energy (as well as area) reduction. ...
... The first bar is the reference and corresponds to a "flat" memory design, in which all operands have to be retrieved from the Fig. 3 Variation of the dynamic energy consumption with the SPM size for the illustrative example in Fig. 2(a). † The CACTI model can be used to estimate the energy consumption for SPMs as explained and exemplified in [3] (Appendix A). Storing on-chip all the signals is, obviously, the most desirable scenario in point of view of dynamic energy consumption. ...
Article
Full-text available
SUMMARY In real-time data-dominated communication and multime- dia processing applications, a multi-layer memory hierarchy is typically used to enhance the system performance and also to reduce the energy con- sumption. Savings of dynamic energy can be obtained by accessing fre- quently used data from smaller on-chip memories rather than from large background memories. This paper focuses on the reduction of the dynamic energy consumption in the memory subsystem of multidimensional signal processing systems, starting from the high-level algorithmic specification of the application. The paper presents a formal model which identifies those parts of arrays more intensely accessed, taking also into account the relative lifetimes of the signals. Tested on a two-layer memory hierarchy, this model led to savings of dynamic energy from 40% to over 70% relative to the energy used in the case of flat memory designs.
... Ein weiterer Untersuchungsansatz wäre daher der Vergleich des Energieverbrauchs der beiden On-Chip-Speicher mit einer gleich großen Chip-Fläche.Die Untersuchungen haben gezeigt, dass die Verwendung von On-Chip-Speicher zu einer deutlichen Reduktion des Energieverbrauchs der Prozessorsysteme führt. Ähnliche Untersuchungen wurden von Banakar et al.[BS01] durchgeführt. Bei den Untersuchungen wurden außer dem Energieverbrauch und der Performance der Chip-Flächenbedarf der beiden On-Chip-Speicher betrachtet. ...