Figure 2 - uploaded by Wilson Hsieh
Content may be subject to copyright.
Comparison of data migration costs and computation migration costs for MCRL on Alewife.

Comparison of data migration costs and computation migration costs for MCRL on Alewife.

Source publication
Conference Paper
Full-text available
Dynamic computation migration is the runtime choice between computation and data migration. Dynamic computation migration speeds up access to concurrent data structures with unpredictable read/write patterns. This paper describes the design, implementation, and evaluation of dynamic computation migration in a multithreaded distributed shared-memory...

Context in source publication

Context 1
... is important to note that the CM-5 performance is only achievable when polling is used. Figure 2 compares the cost to touch a region using pure data migration and dynamically choosing computation migration. The cost to choose computation migration is not measured in these experiments. ...

Similar publications

Article
Full-text available
Vector coprocessors (VPs), commonly being assigned exclusively to a single thread/core, are not often performance and energy efficient due to mismatches with the vector needs of individual applications. We present in this article an easy-to-implement VP virtualization technique that, when applied, enables a multithreaded VP to simultaneously execut...
Conference Paper
Full-text available
In this paper, we present a modeling approach for the management of highly interactive, multithreaded and multimodal dialogues. Our approach enforces the separation of dialogue content and dialogue structure and is based on a statechart language enfolding concepts for hierarchy, concurrency, variable scoping and a detailed runtime history. These co...
Article
Full-text available
The notion that certain procedures are atomic is a fundamental correctness property of many multithreaded software systems. A procedure is atomic if for every execution there is an equivalent serial execution in which the actions performed by any thread while executing the atomic procedure are not interleaved with actions of other threads. Several...
Article
Full-text available
Nowadays, advanced smart mobile devices equipped with multi‐core central processing units for handling multithreaded (MT) applications. However, existing research mainly uses single‐thread (ST) computing to deal with applications, which limits the performance of mobile computing. To make full use of multi‐core resources, this study proposes a fine‐...
Article
Full-text available
Sorting the suffixes of an input string is a fundamental task in many applications such as data compression, genome alignment, and full-text search. The induced sorting (IS) method has been successfully applied to design a number of state-of-the-art suffix sorting algorithms. In particular, the SAIS algorithm designed by the IS method is not only l...

Citations

... CPHASH attempts to execute computation close to data so that the coherence protocol doesn't have to move the data, and was inspired by computation migration in distributed shared memory systems such as MCRL [14] and Olden [8], remote method invocation in parallel programming languages such as Cool [10] and Orca [2], and the O 2 scheduler [6]. CPHASH isn't as general as these computationmigration systems, but applies the idea to a single data structure that is widely used in server applications and investigates whether this approach works for multicore processors. ...
Conference Paper
CPHash is a concurrent hash table for multicore processors. CPHash partitions its table across the caches of cores and uses message passing to transfer lookups/inserts to a partition. CPHash's message passing avoids the need for locks, pipelines batches of asynchronous messages, and packs multiple messages into a single cache line transfer. Experiments on a 80-core machine with 2 hardware threads per core show that CPHash has ~1.6x higher throughput than a hash table implemented using fine-grained locks. An analysis shows that CPHash wins because it experiences fewer cache misses and its cache misses are less expensive, because of less contention for the on-chip interconnect and DRAM. CPServer, a key/value cache server using CPHash, achieves ~5% higher throughput than a key/value cache server that uses a hash table with fine-grained locks, but both achieve better throughput and scalability than memcached. The throughput of CPHash and CPServer also scale near-linearly with the number of cores.
... CPHASH attempts to move computation close to data, and was inspired by computation migration in distributed shared memory systems such as MCRL [12] and Olden [4] and remote method invocation in parallel programming languages such as Cool [6] and Orca [1]. ...
Article
In this thesis we introduce CPHASH - a scalable fixed size hash table that supports eviction using an LRU list, and CPSERVER - a scalable in memory key/value cache server that uses CPHASH to implement its hash table. CPHASH uses computation migration to avoid transferring data between cores. Experiments on a 48 core machine show that CPHASH has 2 to 3 times higher throughput than a hash table implemented using scalable fine-grained locks. CPSERVER achieves 1.2 to 1.7 times higher throughput than a key/value cache server that uses a hash table with scalable fine-grained locks and 1.5 to 2.6 times higher throughput than MEMCACHED.
... Bellosa and Steckermeiser [3] use cachemiss counters to do better thread scheduling. CoreTime dynamically decides to migrate an operation to a different core, which is related to computation migration in distributed-shared memory systems (e.g., [4, 10]) and object-based parallel programming language systems for NUMA systems (e.g., [2, 5, 9]). These systems use simple heuristics that do not take advantage of hardware counters, and do not consider the technique as part of a general scheduling problem. ...
Conference Paper
High performance on multicore processors requires that schedulers be reinvented. Traditional schedulers focus on keeping execution units busy by assigning each core a thread to run. Schedulers ought to focus, however, on high utilization of on-chip memory, rather than of exe- cution cores, to reduce the impact of expensive DRAM and remote cache accesses. A challenge in achieving good use of on-chip memory is that the memory is split up among the cores in the form of many small caches. This paper argues for a form of scheduling that assigns each object and its operations to a specific core, moving a thread among the cores as it uses different objects.
... One is to explicitly differentiate between local and remote pointers (by using a different programming notation for each type). The main drawback of this approach, taken in several existing systems , including Cid [7], Earth [3], Olden [10] [4] , is that the original program must be significantly modified (manually or using compiler analysis [6] [10]) such that it always uses the correct reference—remote or local— based on the given mapping of the data structure. It also prevents the mapping of the data structure to be decided at runtime, which is a major limitation on dynamic pointerbased data structures. ...
Conference Paper
We present an approach to implementing and using global pointers in a distributed computing environment. The programmer is able to create pointer-based distributed data structures, which can then be used by sequential or parallel programs without having to differentiate between local and global pointers. Any reference to a remote address causes the process to either migrate to the remote host, where it continues its execution, or to perform a remote access operation. The decision is made automatically and fully transparently to the programmer. By using a hardware-supported memory checking mechanism, we avoid any overhead associated with the detection of remote references.
... The Linux kernel could use a variant of address ranges internally, and perhaps allow application control via mmap flags. Applications could use kernel-core-like mechanisms to reduce system call invocation costs, to avoid concurrent access to kernel data, or to manage an application's sharing of its own data using techniques like computation migration [6,14]. Finally, it may be possible for Linux to provide sharelike control over the sharing of file descriptors, entries in buffer and directory lookup caches, and network stacks. ...
Conference Paper
Full-text available
Multiprocessor application performance can be limited by the operating system when the application uses the operating system frequently and the operating system services use data structures shared and modified by mul- tiple processing cores. If the application does not need the sharing, then the operating system will become an unnecessary bottleneck to the application's performance. This paper argues that applications should control sharing: the kernel should arrange each data structure so that only a single processor need update it, unless directed otherwise by the application. Guided by this design principle, this paper proposes three operating system abstractions (address ranges, kernel cores, and shares) that allow applications to control inter-core shar- ing and to take advantage of the likely abundance of cores by dedicating cores to specific operating system functions. Measurements of microbenchmarks on the Corey pro- totype operating system, which embodies the new ab- stractions, show how control over sharing can improve performance. Application benchmarks, using MapRe- duce and a Web server, show that the improvements can be significant for overall performance: MapReduce on Corey performs 25% faster than on Linux when using 16 cores. Hardware event counters confirm that these improvements are due to avoiding operations that are ex- pensive on multicore machines.
... Upon start-read/write(X), the region X is mapped locally. MCRL [3] extends CRL with thread migration. A startwrite(X ) now causes migration of the current activation record to the machine that hosts X. ...
Article
A DSM protocol ensures that a thread can access data allo-cated on another machine using some consistency protocol. The consistency protocol can either replicate the data and unify replica changes periodically or the thread, upon re-mote access, can migrate to the machine that hosts the data and access the data there. There is a performance trade-off between these extremes. Data replication suffers from a high memory overhead as every replicated object or page consumes memory on each machine. On the other hand, it is as bad to migrate threads upon each remote access since repeated accesses to the same distributed data set will cause repeated network communication whereas replication will incur this only once (at the cost of increased administration overhead to manage the replicas). We propose a hybrid protocol that uses selective repli-cation with thread migration as its default. Even in the pres-ence of extreme memory pressure and thread-migrations, our protocol reaches or exceeds the performance that can be achieved by means of manual replication and explicit changes of the application's code.
... Dynamic computation migration is useful for concurrent data structures with unpredictable read/write patterns [Hsieh et al., 1996]. Simultaneously considering thread memory access types and global sharing could reduce the communication during load balancing by half [Liang et al., 2002]. ...
... A number of algorithms have been proposed in the context of process migration [Milojicić et al., 2000]. A research of a computation migration using decision at runtime shows that dynamic migration is useful for concurrent data structures with unpredictable read/write patterns [Hsieh et al., 1996]. Another study has pointed out that when threads are migrated from heavily loaded nodes to lightly loaded nodes for load balancing in software DSM systems, the communication cost of maintaining data consistency is increased if migration threads are carelessly selected [Liang et al., 2002]. ...
Thesis
Full-text available
Distributed shared memory (DSM) systems have been recognised as a compelling platform for parallel computing due to the programming advantages and scalability. DSM systems allow applications to access data in a logically shared address space by abstracting away the distinction of physical memory location. As the location of data is transparent, the sources of overhead caused by accessing the distant memories are difficult to analyse. This memory locality problem has been identified as crucial to DSM performance. Many researchers have investigated the problem using simulation as a tool for conducting experiments resulting in the progressive evolution of DSM systems. Nevertheless, both the diversity of architectural configurations and the rapid advance of DSM implementations impose constraints on simulation model designs in two issues: the limitation of the simulation framework on model extensibility and the lack of verification applicability during a simulation run causing the delay in verification process. This thesis studies simulation modelling techniques for memory locality analysis of various DSM systems implemented on top of a cluster of symmetric multiprocessors. The thesis presents a simulation technique to promote model extensibility and proposes a technique for verification applicability, called a Specification-based Parameter Model Interaction (SPMI). The proposed techniques have been implemented in a new interpretation-driven simulation called DSiMCLUSTER on top of a discrete event simulation (DES) engine known as HASE. Experiments have been conducted to determine which factors are most influential on the degree of locality and to determine the possibility to maximise the stability of performance. DSiMCLUSTER has been validated against a SunFire 15K server and has achieved similarity of cache miss results, an average of +-6% with the worst case less than 15% of difference. These results confirm that the techniques used in developing the DSiMCLUSTER can contribute ways to achieve both (a) a highly extensible simulation framework to keep up with the ongoing innovation of the DSM architecture, and (b) the verification applicability resulting in an efficient framework for memory analysis experiments on DSM architecture.
... Whereas D–CVM uses a correlation map to decide which threads to migrate to the same node of the cluster, MILLIPEDE follows a history–based approach by maintaining a history of pages accessed by local threads. M–CRL [10] performs dynamic computation migration through migrating only parts of the thread's current activation records to remote nodes. Mosix [11] enhances the OS kernel to support process migration in a cluster environment. ...
Article
Full-text available
We present a compiler–based technique to automatically identify and extract Remote Procedure Calls, so–called Function Splices, out of potentially arbitrary sequences of Java code compiled for a software DSM. The goal is to lower communication latencies and message traffic by re-placing data shipping by function shipping. Dynamic Func-tion Splicing dynamically decides at runtime whether to in-voke a function splice on the local machine or to execute it remotely on the home node of the requested data. On proof–of–concept micro–benchmarks Dynamic Function Splicing reduces the execution wall time by ap-proximately 29 %; about 25 % of the messages can be saved.
... The information supporting thread selections are remote-reference histories collected from a statistical component in the underlying DSM system. MCRL [5] is a multithread, distributed shared memory system that implements a dynamic choice between data and computation migration. MCRL data objects (programmer-defined regions) are managed using a home-based sequentially consistent protocol. ...
Article
Shared memory programs running on Non-Uniform Memory Access (NUMA) machines usually face inherent performance problems stemming from excessive remote memory accesses. A solution, called the Adaptive Runtime System (ARS), is presented in this paper. ARS is designed to adjust the data distribution at runtime through automatic page migrations. It uses memory access histograms gathered by hardware monitors to find access hot spots and, based on this detection, to dynamically and transparently modify the data layout. In this way, incorrectly allocated data can be moved to the most appropriate node and hence data locality can be improved. Simulations show that this allows to achieve a performance gain of as high as 40%.
... Regions are locally cached until another machine requires the same object, performing some lazy flushing at each end-read/write. MCRL [11] is an object-based system derived from CRL that implements computation migration. Write operations are shipped to the region's creating machine, read operations are performed locally. ...
Article
Jackal is a fine-grained distributed shared memory implementation of the Java programming language. Jackal implements Java's memory model and allows multithreaded Java programs to run unmodified on distributed-memory systems.