Article

Using Prefetching to Hide Lock Acquisition Latency in Distributed Virtual Shared Memory Systems

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Synchronization overhead may limit the number of applications that can take advantage of a shared-memory abstraction on top of emerging network of workstation organizations. While the programmer could spend additional efforts into getting rid of such overhead by restructuring the computation, this paper focuses on a simpler approach where the overhead of lock operations is hidden through lock prefetch annotations. Our approach aims at hiding the lock acquisition latency by prefetching the lock ahead of time. This paper presents a compiler approach which turned out to automatically insert lock prefetching annotations successfully in five out of eight applications. In the other, we show that hand insertion could be done fairly easily without any prior knowledge about the applications. We also study the performance improvements of this approach in detail by considering network of workstation organizations built from uniprocessor as well as symmetric multiprocessor nodes for emerging interconnect technologies such as ATM. It is shown that the significant latencies have a dramatic effect on the lock acquisition overhead and this overhead can be drastically reduced by lock prefetching. Overall, lock prefetching is a simple and effective approach to allow more fine-grained applications to run well on emerging network of workstation platforms.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
TreadMarks is a distributed shared memory (DSM) system for standard Unix systems such as SunOS and Ultrix. This paper presents a performance evaluation of TreadMarks running on Ultrix using DECstation-5000/240's that are connected by a 100-Mbps switch-based ATM LAN and a 10-Mbps Ethernet. Our objective is to determine the efficiency of a user-level DSM implementation on commercially available workstations and operating systems. We achieved good speedups on the 8-processor ATM network for Jacobi (7.4), TSP (7.2), Quicksort (6.3), and ILINK (5.7). For a slightly modified version ofWater from the SPLASH benchmark suite, we achieved only moderate speedups (4.0) due to the high communication and synchronization rate. Speedups decline on the 10-Mbps Ethernet (5.5 for Jacobi, 6.5 for TSP, 4.2 for Quicksort, 5.1 for ILINK, and 2.1 for Water), re ecting the bandwidth limitations of the Ethernet. These results support the contention that, with suitable networking technology, DSM is a viable technique for parallel computation on clusters of workstations. To achieve these speedups, TreadMarks goes to great lengths to reduce the amount of communication performed to maintain memory consistency. It uses a lazy implementation of release consistency, and it allows multiple concurrent writers to modify a page, reducing the impact of false sharing. Great care was taken to minimize communication overhead. In particular, on the ATM network, we used a standard low-level protocol, AAL3/4, bypassing the TCP/IP protocol stack. Unix communication overhead, however, remains the main obstacle in the way of better performance for programs like Water. Compared to the Unix communication overhead, memory management cost (both kernel and user level) is small and wire time is negligible.
Conference Paper
Full-text available
We evaluate the effect of processor speed, network characteristics, and software overhead on the performance of release-consistent software distributed shared memory. We examine five different protocols for implementing release consistency: eager update, eager invalidate, lazy update, lazy invalidate, and a new protocol called lazy hybrid. This lazy hybrid protocol combines the benefits of both lazy update and lazy invalidate. Our simulations indicate that with the processors and networks that are becoming available, coarse-grained applications such as Jacobi and TSP perform well, more or less independent of the protocol used. Medium-grained applications, such as Water, can achieve good performance, but the choice of protocol is critical. For sixteen processors, the best protocol, lazy hybrid, performed more than three times better than the worst, the eager update. Fine-grained applications such as Cholesky achieve little speedup regardless of the protocol used because of the frequency of synchronization operations and the high latency involved. While the use of relaxed memory models, lazy implementations, and multiple-writer protocols has reduced the impact of false sharing, synchronization latency remains a serious problem for software distributed shared memory systems. These results suggest that future work on software DSMs should concentrate on reducing the amount of synchronization or its effect.
Conference Paper
Full-text available
Compares the performance of software-supported shared memory on a general-purpose network to hardware-supported shared memory on a dedicated interconnect. Up to eight processors, the results are based on the execution of a set of application programs on a SGI 4D/480 multiprocessor and on TreadMarks, a distributed shared memory system that runs on a Fore ATM LAN of DECstation-5000/240s. Since the DECstation and the 4D/480 use the same processor, primary cache, and compiler, the shared-memory implementation is the principal difference between the systems. Beyond eight processors, the results are based on execution-driven simulation. Specifically, the authors compare a software implementation on a general-purpose network of uniprocessor nodes, a hardware implementation using a directory-based protocol on a dedicated interconnect, and a combined implementation using software to provide shared memory between multiprocessor nodes with hardware implementing shared memory within a node
Conference Paper
Full-text available
The approach of program-driven simulation of multiprocessors has generally been believed to be too slow in order to perform experiments and performance evaluations with realistic workloads. We show that the program-driven approach for building multiprocessor simulators is indeed a viable method. It compares well in performance to an execution-driven simulator which has been reported in the literature, and has superior flexibility. The reported simulator is the core in the CacheMire test bench which is an entire environment for conducting performance evaluations on shared memory multiprocessors. The test bench is used in a number of projects, including cache coherence protocol evaluation, super-pipelined processor design and analysis of parallel program behaviour. 1
Conference Paper
Full-text available
Introduces optimistic lock synchronization using the group write consistency (GWC) model. GWC guarantees strict ordering of all shared writes in a processor group. In optimistic synchronization, if a lock-requesting processor can assume that the lock is free, execution of mutually exclusive code starts immediately. A wrong assumption results in rollback. Shared variable updates remain in the group until the lock manager grants the lock to the requesting processor. By evaluating the time needed for three processors to execute mutually exclusive code, GWC can out-perform weak, release, and even entry consistency. Simulations of task management using exclusive access to a shared queue, also show much faster mutual exclusion with GWC. Optimistic mutual exclusion may further halve total delays in accessing shared resources
Article
Full-text available
The overall goals and major features of the directory architecture for shared memory (Dash) are presented. The fundamental premise behind the architecture is that it is possible to build a scalable high-performance machine with a single address space and coherent caches. The Dash architecture is scalable in that it achieves linear or near-linear performance growth as the number of processors increases from a few to a few thousand. This performance results from distributing the memory among processing nodes and using a network with scalable bandwidth to connect the nodes. The architecture allows shared data to be cached, significantly reducing the latency of memory accesses and yielding higher processor utilization and higher overall performance. A distributed directory-based protocol that provides cache coherence without compromising scalability is discussed in detail. The Dash prototype machine and the corresponding software support are described
Article
Full-text available
Relaxed memory consistency models, such as release consistency, were introduced in order to reduce the impact of remote memory access latency in both software and hardware distributed shared memory (DSM). However, in a software DSM, it is also important to reduce the number of messages and the amount of data exchanged for remote memory access. Lazy release consistency is a new algorithm for implementing release consistency that lazily pulls modifications across the interconnect only when necessary. Trace-driven simulation using the SPLASH benchmarks indicates that lazy release consistency reduces both the number of messages and the amount of data transferred between processors. These reductions are especially significant for programs that exhibit false sharing and make extensive use of locks. 1 Introduction Over the past few years, researchers in hardware distributed shared memory (DSM) have proposed relaxed memory consistency models to reduce the latency associated with remote memo...
Article
The shared-memory data-parallel model presents an attractive interface for programming multiprocessors by allowing for easy management of parallel tasks while hiding details of the underlying machine architecture. Unfortunately, the shared-memory abstraction requires synchronization in order to maintain data consistency. Present compilers provide consistency between parallel code sections by enforcing a global point of synchrony with a barrier synchronization. Such a simple mechanism possesses several disadvantages. First, the required global collection of information generates significant overhead which leads machine designers to employ special hardware to support barriers. Second, global synchronization reduces parallelism by requiring needless serialization of independent tasks. This work aims to reduce the costs associated with these disadvantages by generating pairwise point-to-point synchronization between specific tasks.
Article
We present the Stanford Parallel Applications for Shared-Memory (SPLASH), a set of parallel applications for use in the design and evaluation of shared-memory multiprocessing systems. Our goal is to provide a suite of realistic applications that will serve as a well-documented and consistent basis for evaluation studies. We describe the applications currently in the suite in detail, discuss some of their important characteristics, and explore their behavior by running them on a real multiprocessor as well as on a simulator of an idealized parallel architecture. We expect the current set of applications to act as a nucleus for a suite that will grow with time.
Article
The SPLASH-2 suite of parallel applications has recently been released to facilitate the study of centralized and distributed shared-address-space multiprocessors. In this context, this paper has two goals. One is to quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well. The properties we study include the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality, as well as how these properties scale with problem size and the number of processors. The other, related goal is methodological: to assist people who will use the programs in architectural evaluations to prune the space of application and machine parameters in an informed and meaningful way. For example, by characterizing the working sets of the applications, we describe which operating points in terms of cache size and problem size are representative of realistic situations, which are not, and which re redundant. Using SPLASH-2 as an example, we hope to convey the importance of understanding the interplay of problem size, number of processors, and working sets in designing experiments and interpreting their results.
Conference Paper
Parallel programs are commonly written using barriers to synchronize parallel processes. Upon reaching a barrier, a processor must stall until all participating processors reach the barrier. A software implementation of the barrier mechanism using shared variables has two major drawbacks. Firstly, the execution of the barrier may be slow as it may not only require execution of several instructions and but also result in hot-spot accesses. Secondly, processors that are stalled waiting for other processors to reach the barrier are essentially idling and cannot do any useful work. In this paper, the notion of the fuzzy barrier is presented, that avoids the above drawbacks. The first problem is avoided by implementing the mechanism in hardware. The second problem is solved by extending the barrier concept to include a region of statements that can be executed by a processor while it awaits synchronization. The barrier regions are constructed by a compiler and consist of several instructions such that a processor is ready to synchronize upon reaching the first instruction in this region and must synchronize before exiting the region. When synchronization does occur, the processors could be executing at any point in their respective barrier regions. The larger the barrier region, the more likely it is that none of the processors will have to stall. Preliminary investigations show that barrier regions can be large and the use of program transformations can significantly increase their size. Examples of situations where such a mechanism can result in improved performance are presented. Results based on a software implementation of the fuzzy barrier on the Encore multiprocessor indicate that the synchronization overhead can be greatly reduced using the mechanism.
Article
A performance evaluation of the Symmetry multiprocessor system revealed that the synchronization mechanism did not perform well for highly contested locks, like those found in certain parallel applications. Several software synchronization mechanisms were developed and evaluated, using a hardware monitor, on the Symmetry multiprocessor system; the mechanisms were to reduce contention for the lock. The mechanisms remain valuable even when changes are made to the hardware synchronization mechanism to improve support for highly contested locks. The Symmetry architecture is described, and a number of lock algorithms and their use of hardware resources are examined. The performance of each lock is observed from the perspective of both the program itself and the total system performance.
Article
We consider a network of workstations (NOW) organization consisting of bus- based multiprocessors interconnected by an ATM interconnect on which a shared- memory programming model is imposed by using a multiple-write r distributed virtual shared memory system. The latencies associated with bringing data into the local memory are a severe performance limitation of such systems. To tolerate the access latencies, we propose a novel prefetch approach and show how it can be integrated into the software-based coherence layer of a multi- ple-writer protocol. This approach uses the access history of each page to guide which pages to prefetch. Based on detailed architectural simulations and seven scientific applications we f ind that our prefetch algorithm can remove a vast majority of the remote operations which improves the performance of all applica- tions. We also find that the bandwidth provided by ATM switches available today is sufficient to accommodate prefetching. However, the protocol processing over- head of available ATM interfaces limits the gain of the prefetching algorithms.
Conference Paper
We consider a network of workstations (NOW) organization consisting of a number of bus-based multiprocessor servers interconnected by an ATM switch. A shared-memory model is supported by distributed virtual shared memory (DVSM) and this paper focuses on the access penalties incurred by (1) ATM and (2) the DVSM software. First, through detailed architectural simulations we find that while the bandwidth and the latency of the ATM switch fabrics are found to be acceptable, the latency incurred by commercially available ATM interfaces has a first order effect on the performance. We also study the effects of various scheduling policies for the coherence handlers. Our data suggest that since the probability of finding an idle processor within a cluster is high, a good policy is to schedule it there instead of letting an extra compute processor execute coherence handlers. Overall, by adjusting the adaptation layer of ATM to a DVSM system we find that ATM is a promising technology for these kinds of systems
Conference Paper
The network interfaces of existing multicomputers require a significant amount of software overhead to provide protection and to implement message passing protocols. The authors describe the design of a low-latency, high-bandwidth, virtual memory-mapped network interface for the SHRIMP multicomputer project at Princeton University. Without sacrificing protection, the network interface achieves low latency by using virtual memory mapping and write-latency hiding techniques, and obtains high bandwidth by providing a user-level block data transfer mechanism. The authors have implemented several message passing primitives in an experimental environment, demonstrating that their approach can reduce the message passing overhead to a few user-level instructions
Conference Paper
Future parallel computers must efficiently execute not only hand-coded applications but also programs written in high-level, parallel programming languages. Today's machines limit these programs to a single communication paradigm, either message-passing or shared-memory, which results in uneven performance. The authors address this problem by defining an interface, Tempest, that exposes low-level communication and memory-system mechanisms so programmers and compilers can customize policies for a given application. Typhoon is a proposed hardware platform that implements these mechanisms with a fully-programmable, user-level processor in the network interface. The authors demonstrate the utility of Tempest with two examples. First, the Stache protocol uses Tempest's fine-grain access control mechanisms to manage part of a processor's local memory as a large, fully-associative cache for remote data. The authors simulated Typhoon on the Wisconsin Wind Tunnel and found that Stache running on Typhoon performs comparably (±30%) to an all-hardware Dir<sub>N</sub>NB cache-coherence protocol for five shared-memory programs. Second, they illustrate how programmers or compilers can use Tempest's flexibility to exploit an application's sharing patterns with a custom protocol. For the EM3D application, the custom protocol improves performance up to 35% over the all-hardware protocol
Article
To offset the effect of read miss penalties on processor utilization in shared-memory multiprocessors, several software- and hardware-based data prefetching schemes have been proposed. A major advantage of hardware techniques is that they need no support from the programmer or compiler. Sequential prefetching is a simple hardware-controlled prefetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache, thus exploiting spatial locality. In its simplest form, the number of prefetched blocks on each miss is fixed throughout the execution. However, since the prefetching efficiency varies during the execution of a program, we propose to adapt the number of pre-fetched blocks according to a dynamic measure of prefetching effectiveness. Simulations of this adaptive scheme show reductions of the number of read misses, the read penalty, and of the execution time by up to 78%, 58%, and 25% respectively
Article
Networks of workstations are poised to become the primary computing infrastructure for science and engineering. NOWs may dramatically improve virtual memory and file system performance; achieve cheap, highly available, and scalable file storage: and provide multiple CPUs for parallel computing. Hurdles that remain include efficient communication hardware and software, global coordination of multiple workstation operating systems, and enterprise-scale network file systems. Our 100-node NOW prototype aims to demonstrate practical solutions to these challenges
The SPLASH-2 Programs: Characterization and Methodological Considerations
  • S C Woo
  • M Ohara
  • E Torrie
  • J P Singh
  • A Gupta
Woo, S. C., Ohara, M., Torrie, E., Singh, J. P. and Gupta, A., "The SPLASH-2 Programs: Characterization and Methodological Considerations," Proceedings of the 22nd Annual International Symposium on Computer Architecture, June 1995, pp. 24-36.