Figure 2 - uploaded by Abu Asaduzzaman
Content may be subject to copyright.
Simulated architecture with two-level of cache memory hierarchy. 

Simulated architecture with two-level of cache memory hierarchy. 

Source publication
Conference Paper
Full-text available
Signal processing has been implemented in many computing devices that are successfully being used in mission-critical NASA programs, military operations, and medical devices. The popularity and demand of signal processing systems are increasing in many other domains including commercial products. Many applications in signal processing systems need...

Context in source publication

Context 1
... simulated architecture includes a single processing core and a memory hierarchy with two levels of caches as shown in Figure 2. L1 cache is split into instruction (I1) and data (D1) caches for improved performance. L2 cache is unified. Processing core encodes the video streams, which is stored in the main memory. The core reads the raw data from (and writes the encoded video into) the main memory through its cache hierarchy. We keep the architecture setup fixed and run MPEG-4 and H.264/AVC encoders one at a time. We develop a simulation program using VisualSim (short for VisualSim Architecture) from Mirabilis Design, Inc. to evaluate the performance and energy. VisualSim is a graphical and hierarchical modeling tool, effective to simulate system level architecture of signal processing systems and real-time applications [24]. We evaluate the performance (in terms of mean delay) and the total energy consumption for various L2 cache size, line size, and associativity level. In this work, we keep each of I1 and D1 fixed at 16KB, L1 line size fixed at 64B, and L1 associativity level fixed at 4-way. We vary the L2 cache size, line size, and associativity level as shown in Table 1. The output parameters include mean delay and the total energy consumption by the system. Delay is defined as the number of processor cycles between the start of execution of a task and the end. In this work, we select two popular and representative real-time applications, MPEG-4 and H.264/AVC, to run the simulation program. MPEG-4 is the global video coding standard for multimedia applications and H.264/AVC is the network friendly video coding standard for conversational and non- conversational applications [20-23]. We characterize both applications using Cachegrind package [25] (video files are encoded with appropriate CODEC – FFmpeg for MPEG-4 and JM-RS (96) for H.264/AVC) and create workloads which capture all possible scenarios that the target architecture will experience. MPEG-4 and H.264/AVC workload statistics are shown in Table 2. These workloads are used to run the VisualSim simulation program. Using Cachegrind, I1, D1, and L2 miss rates are obtained for fixed I1 and D1 parameters and variable L2 cache size, line size, and associativity level. Using VisualSim, we model the selected architecture. The workload we obtain using Cachegrind (as mentioned before) are used to run the VisualSim simulation program. Simulation results are presented in the following subsections. First, we present the mean delay for various L2 cache size, line size, and associativity level. Simulation results show that for smaller L2 caches (between 128KB and 1MB), the mean delay decreases sharply with the increase in L2 cache size (see Figure 3). However, the mean delay decreases less sharply for L2 cache size bigger than 1MB. Results also show that mean delay for the H.264/AVC encoder is smaller than that of MPEG-4 encoder for any L2 cache size. The 4MB size (largest explored) minimizes the mean delay. Figure 4 shows the mean delay for various L2 line size. The delay significantly decreases when the line size is increased between 16B and 64B; for line size between 64B and 128B, the changes are not significant; for line size greater than 128B, the mean delay starts increasing due to cache pollution. The 128B L2 line size minimizes the mean delay. The mean delay for various L2 associativity level is shown in Figure 5. The delay decreases very significantly when the associativity level is increased from 1-way (direct mapping) to 2-way; for associativity level between 2-way and 8-way, the changes are not significant; for associativity level greater than 8-way, mean delay remains almost the same. The 8-way L2 associativity level optimizes the mean delay. In this subsection, we present the total energy consumption for various L2 cache size, line size, and associativity level. As shown in Figure 6, for smaller L2 cache (between 128KB and 1MB), the total energy consumption decreases significantly with the increase of L2 cache size (which is not the case for L2 cache size greater than 1MB). Simulation results also show that H.264/AVC encoding consumes less energy than MPEG-4 encoding. The 4MB size (largest considered) optimizes the total energy consumption. As shown in Figure 7, the total energy consumption decreases significantly when line size is increased between 16B and 64B; for line size between 64B and 128B, the changes are not significant; for line size greater than 128B, total energy consumption starts increasing. This is because, for line size greater than 128B, cache miss ratio increases due to cache pollution. Regardless of the L2 line size, total energy consumption for H.264/AVC encoder is smaller than that for MPEG-4 encoder. The 128B L2 line size also optimizes the total energy consumption. The total energy consumption for various L2 associativity level is shown in Figure 8. The total energy consumption decreases very significantly when associativity level is increased from 1-way (direct mapping) to 2-way; for associativity level between 2- way and 8-way, the changes are not significant; for associativity level greater than 8-way, the total energy consumption remains almost the same. The 8-way L2 associativity is also best for the energy consumption. The popularity and demand for data and computation intensive signal processing systems are increasing. It is beneficial to use high-performance processing cores to carry out the efficient algorithms developed for signal processing systems. On the one hand, the cache improves performance by reducing the speed-gap between the CPU and the main memory. But on the other hand, cache is power hungry and consumes significant amount of energy. Additional energy consumption and dissipation pose challenges to (the battery-life, if the system is battery operated and) the cooling system. In this paper, we optimize the performance and energy consumption of a signal processing system running real-time multimedia applications by tuning L2 cache attributes. Using VisualSim, we model the architecture that performs the computation and data storage. The first level cache of the architecture is split into instruction and data caches and the second level cache is a unified cache. We use MPEG-4 and H.264/AVC encoder algorithms to run the simulation program. Simulation results show that for the target signal processing architecture, the overall system performance can be enhanced and the total energy consumption can be reduced by adjusting the cache parameters. This is because capacity miss, compulsory miss, and conflict misses can be reduced by selecting the right cache parameters. Considering the unit cost, die area, and most favorable performance / energy consumption, the optimum values for L2 cache used in this experiment are 1MB cache size, 128B line size, and 4-way associativity level. The performance and energy optimization simulation platform used in this work can be re-used to select the right algorithm (for example, MPEG-4 or H.264/AVC codec) for a target signal processing system. In the future we plan to investigate the impact of cache locking on performance and energy consumption of real-time multicore signal processing ...

Similar publications

Article
Full-text available
This paper presents a new model for simulations of very large scale traffic networks. The proposed model is based on microscopic cellular automata (CA) extended to eliminate unwanted properties of ordinary CA based models, such as stopping from maximum speed to zero in one time step. The accuracy of the model has been validated by comparisons with...

Citations

... This mode does not require complete information on access patterns or the addresses of the data, which are not always known at compile time. According to Asaduzzaman A et al [8], tuning cache memory attributes to optimizing performance and energy consumption of a single core with a two-level cache memory hierarchy. In multi-core systems, hardware controlled cache may be one of the main impediments in easing programming [9]. ...
Conference Paper
Full-text available
Stream processors, such as Imagine, GPGPUs, FT64 and MASA, typically uses software managed scratchpad instruction memory which improves performance and significantly reduces energy consumption. In this paper, we build a kernel-storage model to analyze the hot spot of kernels in stream programs. Based on the analysis, we define Kernel Hot Code and prove that scratchpad instruction memory should focus on the access efficiency of it. A methodology for finding Kernel Hot Code in the kernels of different structures is presented as well. In accordance with this method, we develop HOIS for Stream Architecture, which adopts a software managed scratchpad memory to store Kernel Hot Code, and uses a small hardware managed victim cache to store the Kernel Cool Code. HOIS is evaluated by measuring the performance of six applications on the MASA_S simulation platform. The results show that HOIS can achieve high efficiency in predictable applications with little performance loss.
Conference Paper
The imitated gray scale space extremum-median filter (IGSSEM) algorithm was proposed to aim at the difficulties existent in the improved extremum and median filter (IEM) algorithm processing color images. This algorithm keeps the characteristic which is reduce the possibility of mistaking the signal pixels for noise pixels, the more appropriate method is used to detect noise pixels. At the same time, when noise pixels are more than half of the total pixels, the standard median filtering algorithm can not resolve it. The algorithm which used imitated gray scale transform to ascertain noise spot obviate the problem of noise prescient judgments. Experimental results show that when the density of noise is high, the algorithm proposed in this paper not only keeps more color image details than IEM filter algorithm, but also more convenient in color image processing.