Figure 1 - uploaded by Mats Brorsson
Content may be subject to copyright.
Memory map of the simulated memory.

Memory map of the simulated memory.

Source publication
Conference Paper
Full-text available
The approach of program-driven simulation of multiprocessors has generally been believed to be too slow in order to perform experiments and performance evaluations with realistic workloads. We show that the program-driven approach for building multiprocessor simulators is indeed a viable method. It compares well in performance to an execution-drive...

Context in source publication

Context 1
... page size can be changed when recompiling the simu- lator. Figure 1 shows the map of the memory model. The first two pages of memory are reserved for passing information from the simulator to the run-time environment of the application program. ...

Similar publications

Conference Paper
Full-text available
This paper presents the Simulation and Wavefront parallel-programming patterns of the MAP3S pattern-based parallel programming system for distributed-memory environments. Both patterns target iterative computations on static-and-regular meshes. In addition to providing performance-oriented features, such as asynchronous communication and distributi...

Citations

... PROTEUS has also been used in the development of the MIT ALEWIFE machine [7]. The CacheMire [8] from Lund University evolved from a simulation model for the Data Diffusion Machime [9] and later was further generalized for the study of sharedmemory architectues. The Wisconsin Wind Tunnel(WWT) [10] from the University of Wisconsin-Madison was originally built to perform cache coherence protocol studies. ...
Article
With the increasing of the parallel machine system's complexity simulation plays a more and more important role in the study of most aspects of parallel machine system. We can say simulation has become an absolutely necessary performance evaluation tool. As the development of the theory and technology, and the great speed and mass storage the latest computer system can provide, it becomes urgent to develop a new performance evaluation platform. Just under such a background, we have designed and are implementing a simulation platform named SPMD-HLAPSE (SPMD High-Level Architecture Performance Simulation Environment) for evaluating high-level performance of parallel machine architecture. This simulation environment is designed for SPMD parallel machine. Up to now most parts of it have been implemented and we have evaluated a SIMD parallel machine using it.
... In Section 5, we simulate the baseline hardwareonly implementation of Section 2.2 and the different software-only directory implementations described in Sections 2.3 and 3. The simulation models are built on top of the CacheMire Test Bench [4], a program-driven simulator and programming environment. The simulator consists of two parts: a functional simulator of multiple SPARC processors and a memory system simulator . ...
Article
The hardware complexity of hardware-only directory protocols in shared-memory multiprocessors has motivated many researchers to emulate directory management by software handlers executed on the compute processors, called software-only directory protocols.In this paper, we evaluate the performance and design trade-offs between these two approaches in the same architectural simulation framework driven by eight applications from the SPLASH-2 suite. Our evaluation reveals some common case operations that can be supported by simple hardware mechanisms and can make the performance of software-only directory protocols competitive with that of hardware-only protocols. These mechanisms aim at either reducing the software handler latency or hiding it by overlapping it with the message latencies associated with inter-node memory transactions. Further, we evaluate the effects of cache block sizes between 16 and 256 bytes as well as two different page placement policies. Overall, we find that a software-only directory protocol enhanced with these mechanisms can reach between 63% and 97% of the baseline hardware-only protocol performance at a lower design complexity.
... It is passed a memory transaction which already contains all relevant information, including logical, physical, and real addresses, and information from the MMU including cache valid bit. 11 It is up to the memory hierarchy module to update cache state and keep track o f r e l e v ant statistics. In order to reduce the number of unnecessary calls to this routine, the memory hierarchy module can lter out cache line accesses by calling mem add to STC(). ...
Article
Full-text available
We describe novel techniques used for efficient simulation of memory in SimICS, an instruction level simulator developed at SICS. The design has focused on efficiently supporting the simulation of multiprocessors, analyzing complex memory hierarchies and running large binaries with a mixture of system-level and user-level code.
... This technique allows the simulation of a parallel system on a sequential machine by interleaving the execution of the different simulated processor in an optimal way [3] [11]. This widely used approach is very flexible and can be made quite realistic by carefully choosing the execution parameters. ...
Article
A new technique for the efficient asynchronous discrete eventdriven simulation of parallel shared-memory computers is proposed. Our execution-driven methodology, while introducing minimal synchronization overhead to maintain a coherent distributed event causality relation, allows complete virtualization of the design at all levels and therefore is very flexible. We give a detailed description of the proposed technique together with some preliminary performance results we obtained by implementing our parallelization technique on a CM-5 computer. 1
... The simulation models are built on top of the CacheMire Test Bench [2], a simulation framework and programming environment. The framework consists of multiple SPARC processors simulated at the instruction level and an architectural simulator of the multiprocessor model. ...
Article
We study in this paper how effective latency-tolerating and -reducing techniques are at cutting the memory access times for shared-memory multiprocessors with directory cache protocols managed by hardware and software. A critical issue for the relative efficiency is how many protocol operations such techniques trigger. This paper presents a framework that makes it possible to reason about the expected relative efficiency of a latency-tolerating or -reducing technique by focusing on whether the technique increases, decreases, or does not change the number of protocol operations at the memory module. Since software-only directory protocols handle these operations in software they will perform relatively worse unless the technique reduces the number of protocol operations. Our experimental results from detailed architectural simulations driven by six applications from the SPLASH-2 parallel program suite confirm this expectation. We find that while prefetching performs relatively worse on software-only directory protocols due to useless prefetches, there are examples of protocol optimizations, e.g., optimizations for migratory data, that do relatively better on software-only directory protocols. Overall, this study shows that latency-tolerating techniques must be more carefully selected for software-centric than for hardware-centric implementations of distributed shared-memory systems.
... The simulation models are built on top of the CacheMire Test Bench [2], a simulation framework and programming environment. The framework consists of multiple SPARC processors simulated at the instruction level and an architectural simulator of the multiprocessor model. ...
... To study the performance effects of this micro benchmark, we have simulated it with the CacheMire test bench [5]; a program driven functional simulator. We simulate a system of four uniprocessor nodes interconnected by an ATM switch. ...
Article
Synchronization overhead may limit the number of applications that can take advantage of a shared-memory abstraction on top of emerging network of workstation organizations. While the programmer could spend additional efforts into getting rid of such overhead by restructuring the computation, this paper focuses on a simpler approach where the overhead of lock operations is hidden through lock prefetch annotations. Our approach aims at hiding the lock acquisition latency by prefetching the lock ahead of time. This paper presents a compiler approach which turned out to automatically insert lock prefetching annotations successfully in five out of eight applications. In the other, we show that hand insertion could be done fairly easily without any prior knowledge about the applications. We also study the performance improvements of this approach in detail by considering network of workstation organizations built from uniprocessor as well as symmetric multiprocessor nodes for emerging interconnect technologies such as ATM. It is shown that the significant latencies have a dramatic effect on the lock acquisition overhead and this overhead can be drastically reduced by lock prefetching. Overall, lock prefetching is a simple and effective approach to allow more fine-grained applications to run well on emerging network of workstation platforms.
... A trace-driven simulator, called superscalar, implements a superscalar processor with support for assisted execution. It is driven by an execution-driven Sparc processor simulator, called CacheMire-2 [1], which generates decoded instruction streams for main thread and nanothreads to superscalar. Superscalar sends requests for instructions to CacheMire-2 with instruction count and thread identifier in the instruction fetch stage. ...
Article
We introduce a new execution paradigm called assisted execution. In this model, a set of auxiliary "assistant" threads, called nanothreads, is attached to each thread of an application. Nanothreads are very lightweight threads which run on the same processor as the main (application) thread and help execute the main thread as fast as possible. Nanothreads exploit resources that are idled in the processor because of dependencies and memory access delays. Assisted execution has the potential to alter the current trade-offs between static and dynamic execution mechanisms. Nanothreads can monitor and reconfigure the underlying hardware, can emulate hardware and can profile applications with little or no interference to improve the program on-line or off-line. We demonstrate the power of assisted execution with an important application, namely data prefetching to fight the memory wall problem. Simulation results on several SPEC95 benchmarks show that sequential and stride prefet...
... The first set is a group of three programs from the Stanford SPLASH benchmark suite [29]: MP3D, WATER and CHOLESKY. These programs were traced using the CacheMire simulator [4] developed by Per Stenstrom's group at Lund University, Sweden, and traces were obtained for systems with 8, 16 and 32 processors. MP3D is a rarefied fluid flow simulation program used to study the forces applied to objects flying in the upper atmosphere at hypersonic speeds, and it is based on Monte Carlo methods. ...
... Figures 7-12 show performance results for MP3D which (as can be seen from Table 4) has a relatively high miss ratio for shared data, and also has a significant fraction of shared data accesses. In the 8 processor MP3D the performance of the 50 MHz bus is comparable to the 200 MHz ring for slower processors ( ≤ 25 MIPS), 4. the processor utilization curves for SIMPLE (snooping and directory) and FFT-directory superimpose. ...
Article
: Advances in circuit and integration technology are continuously boosting the speed of microprocessors. One of the main challenges presented by such developments is the effective use of powerful microprocessors in shared memory multiprocessor configurations. We believe that the interconnection problem is not solved even for small scale shared memory multiprocessors, since shared buses are unlikely to keep up with the memory bandwidth requirements of new microprocessors. In this paper we extensively evaluate the performance of the slotted ring interconnection as a replacement for buses in small to medium scale shared memory systems and for processor clusters in hierarchical massively parallel systems, using a hybrid methodology of analytical models and trace-driven simulations. Snooping and directory-based coherence protocols for the ring are compared in the context of multitasking. 1.0 Introduction and Motivations In the last decade, parallel processing has emerged as the consensus ...
... This code reduces the frequency of context switches. By contrast, program-driven simulators such as Cache-Mire [1] interpret each instruction in software. Cache-Mire also relies on activity scanning (rather than an event list) for scheduling activities from processors and memory. ...
... Trying to compare the efficiency of the testbed with existing simulators is a hazardous task at best. However, we have used two state-of-the-art software simulators, Cache-Mire [1] and [5] and there are write buffers with multiple outstanding requests in both the first-level and the second-level caches [3]. The simulation rate of Cache-Mire is the number of cycles of the target system simulated per second. ...
Article
Full-text available
In multiprocessor systems, processing nodes contain a processor, some cache and a share of the system memory, and are connected through a scalable interconnect. The system memory partitions may be shared (shared-memory systems) or disjoint (messagepassing systems). Within each class of systems many architectural variations are possible. Comparisons among systems are difficult because of the lack of a common hardware platform to implement the different architectures. The U.S.C. Multiprocessor Testbed is a hardware emulator for the rapid prototyping of vastly different multiprocessor systems. In the testbed the hardware of the target machine is emulated by reprogrammable controllers implemented with Field-Programmable Gate Arrays (FPGAs). The processors, memories and interconnect are off-the-shelf and their relative speed can be modified to emulate various component technologies. Every emulation is an actual incarnation of the target machine and therefore all software written for the tar...