1: M(50, 10) Machine -MCPR Relative to Vblock(4, 64, 16, (1,1))

Source publication

The Effects of Block Size on the Performance of Coherent Caches in Shared-Memory Multiprocessors

Article

Full-text available

May 1993

Thesis (Ph. D.)--University of Rochester. Dept. of Computer Science, 1993. Simultaneously published in the Technical Report series. Several studies have shown that the performance of coherent caches depends on the relationship between the cache block size and the granularity of sharing and locality exhibited by the program. Large cache blocks explo...

Context 1

... believe this choice represents a reasonable machine of today. The performance of adjustable caches for the remaining machines is investigated in chapter 7. Table 6.1 gives the mean cost per reference relative to the adjustable cache Vblock(.1.64. 16) for five fixed line size caches (with line sizes between 4 and 64 words) without prefetch- ing, and two prefetching caches, Prefetch (8,8) and Prefetch (16,4). ...

View in full-text

Energy Savings via Dead Sub-Block Prediction

Data

Full-text available

Jan 2012

Cache memories have traditionally been designed to exploit spatial locality by fetching entire cache lines from memory upon a miss. However, recent studies have shown that often the number of sub-blocks within a line that are actually used is low. Furthermore, those sub-blocks that are used are accessed only a few times before becoming dead (i.e.,...

Fig. 3: Proposed hashing scheme. (a) Hash function to determine the...

Fig. 5: Comparison of the analytics throughputs. Higher is better.

GraphTango: A Hybrid Representation Format for Efficient Streaming Graph Updates and Analysis

Preprint

Full-text available

Dec 2022

Streaming graph processing involves performing updates and analytics on a time-evolving graph. The underlying representation format largely determines the throughputs of these updates and analytics phases. Existing formats usually employ variations of hash tables or adjacency lists. However, adjacency-list-based approaches perform poorly on heavy-t...

Well-Structured Futures and Cache Locality

Article

Full-text available

Sep 2013

In fork-join parallelism, a sequential program is split into a directed acyclic graph of tasks linked by directed dependency edges, and the tasks are executed, possibly in parallel, in an order consistent with their dependencies. A popular and effective way to extend fork-join parallelism is to allow threads to create {futures. A thread creates a f...

Cache-Access Pattern Attack on Disaligned AES T-Tables

Conference Paper

Full-text available

Mar 2013

Cache attacks are a special form of implementation attacks and focus on the exploitation of weaknesses in the implementation of a specific algorithm. We demonstrate an access-driven cache attack, which is based on the analysis of memory-access patterns due to the T-table accesses of the Advanced Encryption Standard (AES). Based on the work of Trome...

Lecture Notes in Computer Science

Chapter

Full-text available

Jan 2011

CAMELLIA is a 128 bit block cipher certified for its security by NESSIE and CRYPTREC. Yet an implementation of CAMELLIA can easily fall prey to cache attacks. In this paper we present an attack on CAMELLIA, which utilizes cache access patterns along with the differential properties of CAMELLIA's s-boxes. The attack, when implemented on a PowerPC mi...

Cache Coherence Mechanisms

Article

Jan 2015

Mohammed Alshehri

Many modern computing architectures that utilize dedicated caches rely on coherency mechanisms to maintain consistency across dedicated caches [2]. These mechanisms, which are to be the focus of this paper, rely on underlying hardware synchronicity to resolve the issue of the value of a particular piece of data at a given instant of time based on the discrete way in which processor instructions are executed with each clock cycle with corresponding memory accesses following[2], [4]. It should be clear that inconsistencies occur when data is written and noticed when data is read. It is the goal of this paper to explore the idiosyncrasies of the coherence mechanisms involved with dedicated caches via researching two common types of mechanisms, snoop-based and directory-based, and simulating their operations on top of a simulated architecture consisting of multiple processing cores and a layered cache system consisting of dedicated caches. In this paper, we implemented snoopy and directory protocols, and measure hit rate, compulsory miss rate, capacity miss rate, and coherence forces for each one. In addition, we show that how each scheme affected by block size and cache size.

Cache Coherence Mechanisms

Article

Full-text available

Jan 2015

Cache Coherence Mechanisms

Article

Full-text available

Jan 2015

Can High Bandwidth and Latency Justify Large Cache Blocks in Scalable Multiprocessors?

Article

Oct 1994

An important architectural design decision affecting the performance of coherent caches is the choice of block size. There are two primary factors that influence this choice: the reference behavior of applications and the remote access bandwidth and latency of the machine. Given that we anticipate increases in both network bandwidth and latency (in processor cycles) in scalable shared-memory multiprocessors, the question arises as to what effect these increases will have on the choice of block size. We use analytical modeling and execution-driven simulation of parallel programs on a large-scale shared-memory machine to examine the relationship between cache block size and application performance as a function of remote access bandwidth and latency. We show that even under assumptions of high remote access bandwidth and latency, the best application performance usually results from using cache blocks between S2 and 128 bytes in size. We also show that modifying the program to remove the dominant source of misses may not increase the best performing block size. We conclude that large cache blocks cannot be justified in most realistic scenarios.

Evaluation of parallel architectures with distributed shared memory : study and implementation of an emulator.

Article

Sep 1996

Olivier Jacquiot

This thesis has two main goals: the study and the implementation of an emulator of parallel computers with shared virtual memories. These two main objects have led this thesis to be divided in two distinct parts. The first part is a study of every technics that can be used to construct memories hierarchy, or that can be used to maintain data consistency in such memories. The second part describes the emulator. The objective of any emulator is to be enough convenient as possible. For this reason our emulator must include sufficient parameters to emulated the widest possible different types of parallel machines, while having a response delay not tremendously different regards to the one from a real execution. To meet this last requirement our emulator really executes all the instructions except the page exchanges across the interconnection network that are simulated. The parameters of the emulator are: number of processors, network's characteristics (e.g latency, bit rate), and consistency methods to describe the target machine; the data size and distribution to describe applications. Our emulator execute himself onto a MACH micro-kernel and a UNIX server. It use some functions of the MACH micro-kernel, specially external pager.

Classifying and alleviating the communication overheads in matrix computations on large-scale NUMA multiprocessors

Article

Dec 1998
J SYST SOFTWARE

Large-scale, shared-memory multiprocessors have non-uniform memory access (NUMA) costs. The high communication cost dominates the source of matrix computations' execution. Memory contention and remote memory access are two major communication overheads on large-scale NUMA multiprocessors. However, previous experiments and discussions focus either on reducing the number of remote memory accesses or on alleviating memory contention overhead. In this paper, we propose a simple but effective processor allocation policy, called rectangular processor allocation, to alleviate both overheads at the same time. The policy divides the matrix elements into a certain number of rectangular blocks, and assigns each processor to compute the results of one rectangular block. This methodology may reduce a lot of unnecessary memory accesses to the memory modules. After running many matrix computations under a realistic memory system simulator, we confirmed that at least one-fourth of the communication overhead may be reduced. Therefore, we conclude that rectangular processor allocation policy performs better than other popular policies, and that the combination of rectangular processor allocation policy with software interleaving data allocation policy is a better choice to alleviate communication overhead.

1: M(50, 10) Machine -MCPR Relative to Vblock(4, 64, 16, (1,1))

Context in source publication

Similar publications

Citations