Memory map of the simulated memory.

Source publication

The Cachemire Test Bench A Flexible And Effective Approach For Simulation Of Multiprocessors

Conference Paper

Full-text available

Jan 1993

The approach of program-driven simulation of multiprocessors has generally been believed to be too slow in order to perform experiments and performance evaluations with realistic workloads. We show that the program-driven approach for building multiprocessor simulators is indeed a viable method. It compares well in performance to an execution-drive...

Context 1

... page size can be changed when recompiling the simu- lator. Figure 1 shows the map of the memory model. The first two pages of memory are reserved for passing information from the simulator to the run-time environment of the application program. ...

View in full-text

The MAP3S Static-and-Regular Mesh Simulation and Wavefront Parallel-Programming Patterns

Conference Paper

Full-text available

Oct 2008

This paper presents the Simulation and Wavefront parallel-programming patterns of the MAP3S pattern-based parallel programming system for distributed-memory environments. Both patterns target iterative computations on static-and-regular meshes. In addition to providing performance-oriented features, such as asynchronous communication and distributi...

Design and Implementation of a SPMD Simulation Platform

Article

Aug 2005

With the increasing of the parallel machine system's complexity simulation plays a more and more important role in the study of most aspects of parallel machine system. We can say simulation has become an absolutely necessary performance evaluation tool. As the development of the theory and technology, and the great speed and mass storage the latest computer system can provide, it becomes urgent to develop a new performance evaluation platform. Just under such a background, we have designed and are implementing a simulation platform named SPMD-HLAPSE (SPMD High-Level Architecture Performance Simulation Environment) for evaluating high-level performance of parallel machine architecture. This simulation environment is designed for SPMD parallel machine. Up to now most parts of it have been implemented and we have evaluated a SIMD parallel machine using it.

A comparative evaluation of hardware-only and software-only directory protocols in shared-memory multiprocessors

Article

Sep 2004
J SYST ARCHITECT

The hardware complexity of hardware-only directory protocols in shared-memory multiprocessors has motivated many researchers to emulate directory management by software handlers executed on the compute processors, called software-only directory protocols.In this paper, we evaluate the performance and design trade-offs between these two approaches in the same architectural simulation framework driven by eight applications from the SPLASH-2 suite. Our evaluation reveals some common case operations that can be supported by simple hardware mechanisms and can make the performance of software-only directory protocols competitive with that of hardware-only protocols. These mechanisms aim at either reducing the software handler latency or hiding it by overlapping it with the message latencies associated with inter-node memory transactions. Further, we evaluate the effects of cache block sizes between 16 and 256 bytes as well as two different page placement policies. Overall, we find that a software-only directory protocol enhanced with these mechanisms can reach between 63% and 97% of the baseline hardware-only protocol performance at a lower design complexity.

Efficient Memory Simulation in SimICS

Article

Full-text available

Jan 2004

We describe novel techniques used for efficient simulation of memory in SimICS, an instruction level simulator developed at SICS. The design has focused on efficiently supporting the simulation of multiprocessors, analyzing complex memory hierarchies and running large binaries with a mixture of system-level and user-level code.

A Technique for the Distributed Simulation of Parallel Computers

Article

May 2001

Livio Ricciulli

A new technique for the efficient asynchronous discrete eventdriven simulation of parallel shared-memory computers is proposed. Our execution-driven methodology, while introducing minimal synchronization overhead to maintain a coherent distributed event causality relation, allows complete virtualization of the design at all levels and therefore is very flexible. We give a detailed description of the proposed technique together with some preliminary performance results we obtained by implementing our parallelization technique on a CM-5 computer. 1

Comparative Evaluation of Latency-Tolerating and -Reducing Techniques for Hardware-Only and Software-Only Directory Protocols

Article

Jul 2000
J PARALLEL DISTR COM

We study in this paper how effective latency-tolerating and -reducing techniques are at cutting the memory access times for shared-memory multiprocessors with directory cache protocols managed by hardware and software. A critical issue for the relative efficiency is how many protocol operations such techniques trigger. This paper presents a framework that makes it possible to reason about the expected relative efficiency of a latency-tolerating or -reducing technique by focusing on whether the technique increases, decreases, or does not change the number of protocol operations at the memory module. Since software-only directory protocols handle these operations in software they will perform relatively worse unless the technique reduces the number of protocol operations. Our experimental results from detailed architectural simulations driven by six applications from the SPLASH-2 parallel program suite confirm this expectation. We find that while prefetching performs relatively worse on software-only directory protocols due to useless prefetches, there are examples of protocol optimizations, e.g., optimizations for migratory data, that do relatively better on software-only directory protocols. Overall, this study shows that latency-tolerating techniques must be more carefully selected for software-centric than for hardware-centric implementations of distributed shared-memory systems.

Comparative Evaluation of Latency-Tolerating and -Reducing Techniques for Hardware-Only and Software-Only Directory Protocols.

Article

Full-text available

Jan 2000

Using Prefetching to Hide Lock Acquisition Latency in Distributed Virtual Shared Memory Systems

Article

Sep 1999

Synchronization overhead may limit the number of applications that can take advantage of a shared-memory abstraction on top of emerging network of workstation organizations. While the programmer could spend additional efforts into getting rid of such overhead by restructuring the computation, this paper focuses on a simpler approach where the overhead of lock operations is hidden through lock prefetch annotations. Our approach aims at hiding the lock acquisition latency by prefetching the lock ahead of time. This paper presents a compiler approach which turned out to automatically insert lock prefetching annotations successfully in five out of eight applications. In the other, we show that hand insertion could be done fairly easily without any prior knowledge about the applications. We also study the performance improvements of this approach in detail by considering network of workstation organizations built from uniprocessor as well as symmetric multiprocessor nodes for emerging interconnect technologies such as ATM. It is shown that the significant latencies have a dramatic effect on the lock acquisition overhead and this overhead can be drastically reduced by lock prefetching. Overall, lock prefetching is a simple and effective approach to allow more fine-grained applications to run well on emerging network of workstation platforms.

Assisted Execution

Article

Jun 1999

We introduce a new execution paradigm called assisted execution. In this model, a set of auxiliary "assistant" threads, called nanothreads, is attached to each thread of an application. Nanothreads are very lightweight threads which run on the same processor as the main (application) thread and help execute the main thread as fast as possible. Nanothreads exploit resources that are idled in the processor because of dependencies and memory access delays. Assisted execution has the potential to alter the current trade-offs between static and dynamic execution mechanisms. Nanothreads can monitor and reconfigure the underlying hardware, can emulate hardware and can profile applications with little or no interference to improve the program on-line or off-line. We demonstrate the power of assisted execution with an important application, namely data prefetching to fight the memory wall problem. Simulation results on several SPEC95 benchmarks show that sequential and stride prefet...

The Performance of Cache-Coherent Ring-based Multiprocessors

Article

Jan 1999

: Advances in circuit and integration technology are continuously boosting the speed of microprocessors. One of the main challenges presented by such developments is the effective use of powerful microprocessors in shared memory multiprocessor configurations. We believe that the interconnection problem is not solved even for small scale shared memory multiprocessors, since shared buses are unlikely to keep up with the memory bandwidth requirements of new microprocessors. In this paper we extensively evaluate the performance of the slotted ring interconnection as a replacement for buses in small to medium scale shared memory systems and for processor clusters in hierarchical massively parallel systems, using a hybrid methodology of analytical models and trace-driven simulations. Snooping and directory-based coherence protocols for the ring are compared in the context of multitasking. 1.0 Introduction and Motivations In the last decade, parallel processing has emerged as the consensus ...

The USC Multiprocessor Testbed Project: Project Overview

Article

Full-text available

Jan 1999

In multiprocessor systems, processing nodes contain a processor, some cache and a share of the system memory, and are connected through a scalable interconnect. The system memory partitions may be shared (shared-memory systems) or disjoint (messagepassing systems). Within each class of systems many architectural variations are possible. Comparisons among systems are difficult because of the lack of a common hardware platform to implement the different architectures. The U.S.C. Multiprocessor Testbed is a hardware emulator for the rapid prototyping of vastly different multiprocessor systems. In the testbed the hardware of the target machine is emulated by reprogrammable controllers implemented with Field-Programmable Gate Arrays (FPGAs). The processors, memories and interconnect are off-the-shelf and their relative speed can be modified to emulate various component technologies. Every emulation is an actual incarnation of the target machine and therefore all software written for the tar...

Memory map of the simulated memory.

Context in source publication

Similar publications

Citations