Comparison of data migration costs and computation migration costs for MCRL on Alewife.

Source publication

Dynamic Computation Migration in DSM Systems

Conference Paper

Full-text available

Feb 1996

Dynamic computation migration is the runtime choice between computation and data migration. Dynamic computation migration speeds up access to concurrent data structures with unpredictable read/write patterns. This paper describes the design, implementation, and evaluation of dynamic computation migration in a multithreaded distributed shared-memory...

Context 1

... is important to note that the CM-5 performance is only achievable when polling is used. Figure 2 compares the cost to touch a region using pure data migration and dynamically choosing computation migration. The cost to choose computation migration is not measured in these experiments. ...

View in full-text

Vector Coprocessor Virtualization for Simultaneous Multithreading

Article

Full-text available

May 2016

Vector coprocessors (VPs), commonly being assigned exclusively to a single thread/core, are not often performance and energy efficient due to mismatches with the vector needs of individual applications. We present in this article an easy-to-implement VP virtualization technique that, when applied, enables a multithreaded VP to simultaneously execut...

Modeling parallel state charts for multithreaded multimodal dialogues

Conference Paper

Full-text available

Nov 2011

In this paper, we present a modeling approach for the management of highly interactive, multithreaded and multimodal dialogues. Our approach enforces the separation of dialogue content and dialogue structure and is based on a statechart language enfolding concepts for hierarchy, concurrency, variable scoping and a detailed runtime history. These co...

Exploiting Purity for Atomicity.

Article

Full-text available

Jan 2005

The notion that certain procedures are atomic is a fundamental correctness property of many multithreaded software systems. A procedure is atomic if for every execution there is an equivalent serial execution in which the actions performed by any thread while executing the atomic procedure are not interleaved with actions of other threads. Several...

Algorithm 1. Linear time searching algorithm

Cooperative Scheduling of Multi-Core and Cloud Resources: Fine-Grained Offloading Strategy for Multithreaded Applications

Article

Full-text available

Jun 2020

Nowadays, advanced smart mobile devices equipped with multi‐core central processing units for handling multithreaded (MT) applications. However, existing research mainly uses single‐thread (ST) computing to deal with applications, which limits the performance of mobile computing. To make full use of multi‐core resources, this study proposes a fine‐...

Runtime vs. number of processes for SAIS and SACAK

Architecture and execution order of the pipeline. a Architecture. b...

A running example for the pipelined execution of Algorithm 3. The row...

Definitions of time ratios r1\documentclass[12pt]{minimal}...

Fast induced sorting suffixes on a multicore machine

Article

Full-text available

Jul 2018

Sorting the suffixes of an input string is a fundamental task in many applications such as data compression, genome alignment, and full-text search. The induced sorting (IS) method has been successfully applied to design a number of state-of-the-art suffix sorting algorithms. In particular, the SAIS algorithm designed by the IS method is not only l...

CPHASH: a cache-partitioned hash table

Conference Paper

Sep 2012
ACM SIGPLAN NOTICES

CPHash is a concurrent hash table for multicore processors. CPHash partitions its table across the caches of cores and uses message passing to transfer lookups/inserts to a partition. CPHash's message passing avoids the need for locks, pipelines batches of asynchronous messages, and packs multiple messages into a single cache line transfer. Experiments on a 80-core machine with 2 hardware threads per core show that CPHash has ~1.6x higher throughput than a hash table implemented using fine-grained locks. An analysis shows that CPHash wins because it experiences fewer cache misses and its cache misses are less expensive, because of less contention for the on-chip interconnect and DRAM. CPServer, a key/value cache server using CPHash, achieves ~5% higher throughput than a key/value cache server that uses a hash table with fine-grained locks, but both achieve better throughput and scalability than memcached. The throughput of CPHash and CPServer also scale near-linearly with the number of cores.

CPHASH : a cache-partitioned hash table with LRU eviction

Article

Oct 2011

Zviad Metreveli

In this thesis we introduce CPHASH - a scalable fixed size hash table that supports eviction using an LRU list, and CPSERVER - a scalable in memory key/value cache server that uses CPHASH to implement its hash table. CPHASH uses computation migration to avoid transferring data between cores. Experiments on a 48 core machine show that CPHASH has 2 to 3 times higher throughput than a hash table implemented using scalable fine-grained locks. CPSERVER achieves 1.2 to 1.7 times higher throughput than a key/value cache server that uses a hash table with scalable fine-grained locks and 1.5 to 2.6 times higher throughput than MEMCACHED.

Reinventing Scheduling for Multicore Systems

Conference Paper

Jan 2009

High performance on multicore processors requires that schedulers be reinvented. Traditional schedulers focus on keeping execution units busy by assigning each core a thread to run. Schedulers ought to focus, however, on high utilization of on-chip memory, rather than of exe- cution cores, to reduce the impact of expensive DRAM and remote cache accesses. A challenge in achieving good use of on-chip memory is that the memory is split up among the cores in the form of many small caches. This paper argues for a form of scheduling that assigns each object and its operations to a specific core, moving a thread among the cores as it uses different objects.

Efficient Global Pointers With Spontaneous Process Migration

Conference Paper

Mar 2008

We present an approach to implementing and using global pointers in a distributed computing environment. The programmer is able to create pointer-based distributed data structures, which can then be used by sequential or parallel programs without having to differentiate between local and global pointers. Any reference to a remote address causes the process to either migrate to the remote host, where it continues its execution, or to perform a remote access operation. The decision is made automatically and fully transparently to the programmer. By using a hardware-supported memory checking mechanism, we avoid any overhead associated with the detection of remote references.

Corey: An Operating System for Many Cores

Conference Paper

Full-text available

Jan 2008

Multiprocessor application performance can be limited by the operating system when the application uses the operating system frequently and the operating system services use data structures shared and modified by mul- tiple processing cores. If the application does not need the sharing, then the operating system will become an unnecessary bottleneck to the application's performance. This paper argues that applications should control sharing: the kernel should arrange each data structure so that only a single processor need update it, unless directed otherwise by the application. Guided by this design principle, this paper proposes three operating system abstractions (address ranges, kernel cores, and shares) that allow applications to control inter-core shar- ing and to take advantage of the likely abundance of cores by dedicating cores to specific operating system functions. Measurements of microbenchmarks on the Corey pro- totype operating system, which embodies the new ab- stractions, show how control over sharing can improve performance. Application benchmarks, using MapRe- duce and a Web server, show that the improvements can be significant for overall performance: MapReduce on Corey performs 25% faster than on Linux when using 16 cores. Hardware event counters confirm that these improvements are due to avoiding operations that are ex- pensive on multicore machines.

A DSM Protocol Aware of Both Thread Migration and Memory Constraints

Article

Jan 2008

A DSM protocol ensures that a thread can access data allo-cated on another machine using some consistency protocol. The consistency protocol can either replicate the data and unify replica changes periodically or the thread, upon re-mote access, can migrate to the machine that hosts the data and access the data there. There is a performance trade-off between these extremes. Data replication suffers from a high memory overhead as every replicated object or page consumes memory on each machine. On the other hand, it is as bad to migrate threads upon each remote access since repeated accesses to the same distributed data set will cause repeated network communication whereas replication will incur this only once (at the cost of increased administration overhead to manage the replicas). We propose a hybrid protocol that uses selective repli-cation with thread migration as its default. Even in the pres-ence of extreme memory pressure and thread-migrations, our protocol reaches or exceeds the performance that can be achieved by means of manual replication and explicit changes of the application's code.

Simulation Modelling of Distributed-Shared Memory Multiprocessors

Thesis

Full-text available

Feb 2006

Worawan Marurngsith

Distributed shared memory (DSM) systems have been recognised as a compelling platform for parallel computing due to the programming advantages and scalability. DSM systems allow applications to access data in a logically shared address space by abstracting away the distinction of physical memory location. As the location of data is transparent, the sources of overhead caused by accessing the distant memories are difficult to analyse. This memory locality problem has been identified as crucial to DSM performance. Many researchers have investigated the problem using simulation as a tool for conducting experiments resulting in the progressive evolution of DSM systems. Nevertheless, both the diversity of architectural configurations and the rapid advance of DSM implementations impose constraints on simulation model designs in two issues: the limitation of the simulation framework on model extensibility and the lack of verification applicability during a simulation run causing the delay in verification process. This thesis studies simulation modelling techniques for memory locality analysis of various DSM systems implemented on top of a cluster of symmetric multiprocessors. The thesis presents a simulation technique to promote model extensibility and proposes a technique for verification applicability, called a Specification-based Parameter Model Interaction (SPMI). The proposed techniques have been implemented in a new interpretation-driven simulation called DSiMCLUSTER on top of a discrete event simulation (DES) engine known as HASE. Experiments have been conducted to determine which factors are most influential on the degree of locality and to determine the possibility to maximise the stability of performance. DSiMCLUSTER has been validated against a SunFire 15K server and has achieved similarity of cache miss results, an average of +-6% with the worst case less than 15% of difference. These results confirm that the techniques used in developing the DSiMCLUSTER can contribute ways to achieve both (a) a highly extensible simulation framework to keep up with the ongoing innovation of the DSM architecture, and (b) the verification applicability resulting in an efficient framework for memory analysis experiments on DSM architecture.

Latency reduction in software-DSMS by means of dynamic function splicing

Article

Full-text available

Jan 2004

We present a compiler–based technique to automatically identify and extract Remote Procedure Calls, so–called Function Splices, out of potentially arbitrary sequences of Java code compiled for a software DSM. The goal is to lower communication latencies and message traffic by re-placing data shipping by function shipping. Dynamic Func-tion Splicing dynamically decides at runtime whether to in-voke a function splice on the local machine or to execute it remotely on the home node of the requested data. On proof–of–concept micro–benchmarks Dynamic Function Splicing reduces the execution wall time by ap-proximately 29 %; about 25 % of the messages can be saved.

ARS: An adaptive runtime system for locality optimization

Article

Jun 2003
FUTURE GENER COMP SY

Shared memory programs running on Non-Uniform Memory Access (NUMA) machines usually face inherent performance problems stemming from excessive remote memory accesses. A solution, called the Adaptive Runtime System (ARS), is presented in this paper. ARS is designed to adjust the data distribution at runtime through automatic page migrations. It uses memory access histograms gathered by hardware monitors to find access hot spots and, based on this detection, to dynamically and transparently modify the data layout. In this way, incorrectly allocated data can be moved to the most appropriate node and hence data locality can be improved. Simulations show that this allows to achieve a performance gain of as high as 40%.

Runtime Optimizations for a Java DSM Implementation

Article

Mar 2002
CONCURR COMP-PRACT E

Jackal is a fine-grained distributed shared memory implementation of the Java programming language. Jackal implements Java's memory model and allows multithreaded Java programs to run unmodified on distributed-memory systems.

Comparison of data migration costs and computation migration costs for MCRL on Alewife.

Context in source publication

Similar publications

Citations