Table 6 - uploaded by Cezary Dubnicki
Content may be subject to copyright.
1: M(50, 10) Machine -MCPR Relative to Vblock(4, 64, 16, (1,1)) 

1: M(50, 10) Machine -MCPR Relative to Vblock(4, 64, 16, (1,1)) 

Source publication
Article
Full-text available
Thesis (Ph. D.)--University of Rochester. Dept. of Computer Science, 1993. Simultaneously published in the Technical Report series. Several studies have shown that the performance of coherent caches depends on the relationship between the cache block size and the granularity of sharing and locality exhibited by the program. Large cache blocks explo...

Context in source publication

Context 1
... believe this choice represents a reasonable machine of today. The performance of adjustable caches for the remaining machines is investigated in chapter 7. Table 6.1 gives the mean cost per reference relative to the adjustable cache Vblock(.1.64. 16) for five fixed line size caches (with line sizes between 4 and 64 words) without prefetch- ing, and two prefetching caches, Prefetch (8,8) and Prefetch (16,4). ...

Similar publications

Data
Full-text available
Cache memories have traditionally been designed to exploit spatial locality by fetching entire cache lines from memory upon a miss. However, recent studies have shown that often the number of sub-blocks within a line that are actually used is low. Furthermore, those sub-blocks that are used are accessed only a few times before becoming dead (i.e.,...
Preprint
Full-text available
Streaming graph processing involves performing updates and analytics on a time-evolving graph. The underlying representation format largely determines the throughputs of these updates and analytics phases. Existing formats usually employ variations of hash tables or adjacency lists. However, adjacency-list-based approaches perform poorly on heavy-t...
Article
Full-text available
In fork-join parallelism, a sequential program is split into a directed acyclic graph of tasks linked by directed dependency edges, and the tasks are executed, possibly in parallel, in an order consistent with their dependencies. A popular and effective way to extend fork-join parallelism is to allow threads to create {futures. A thread creates a f...
Conference Paper
Full-text available
Cache attacks are a special form of implementation attacks and focus on the exploitation of weaknesses in the implementation of a specific algorithm. We demonstrate an access-driven cache attack, which is based on the analysis of memory-access patterns due to the T-table accesses of the Advanced Encryption Standard (AES). Based on the work of Trome...
Chapter
Full-text available
CAMELLIA is a 128 bit block cipher certified for its security by NESSIE and CRYPTREC. Yet an implementation of CAMELLIA can easily fall prey to cache attacks. In this paper we present an attack on CAMELLIA, which utilizes cache access patterns along with the differential properties of CAMELLIA's s-boxes. The attack, when implemented on a PowerPC mi...

Citations

... Memory references from the different processor to shared data variables can access the synchronization variables. It means that a developer is not able to make any assumptions on the ordering event between synchronization points while using the weak-ordering model [7], [8]. In efforts to prevent the nondeterministic operations each processor must then guarantee that the outstanding shared memory accesses are first completed before a synchronization operation can be issued. ...
... • Large cache blocks might cause unnecessary coherence forces due to false sharing [8]. ...
Article
Many modern computing architectures that utilize dedicated caches rely on coherency mechanisms to maintain consistency across dedicated caches [2]. These mechanisms, which are to be the focus of this paper, rely on underlying hardware synchronicity to resolve the issue of the value of a particular piece of data at a given instant of time based on the discrete way in which processor instructions are executed with each clock cycle with corresponding memory accesses following[2], [4]. It should be clear that inconsistencies occur when data is written and noticed when data is read. It is the goal of this paper to explore the idiosyncrasies of the coherence mechanisms involved with dedicated caches via researching two common types of mechanisms, snoop-based and directory-based, and simulating their operations on top of a simulated architecture consisting of multiple processing cores and a layered cache system consisting of dedicated caches. In this paper, we implemented snoopy and directory protocols, and measure hit rate, compulsory miss rate, capacity miss rate, and coherence forces for each one. In addition, we show that how each scheme affected by block size and cache size.
... Memory references from the different processor to shared data variables can access the synchronization variables. It means that a developer is not able to make any assumptions on the ordering event between synchronization points while using the weak-ordering model [7], [8]. In efforts to prevent the nondeterministic operations each processor must then guarantee that the outstanding shared memory accesses are first completed before a synchronization operation can be issued. ...
... • Small cache blocks can reduce the coherence forces but will lead to more time. • Large cache blocks might cause unnecessary coherence forces due to false sharing [8]. ...
Article
Full-text available
Many modern computing architectures that utilize dedicated caches rely on coherency mechanisms to maintain consistency across dedicated caches [2]. These mechanisms, which are to be the focus of this paper, rely on underlying hardware synchronicity to resolve the issue of the value of a particular piece of data at a given instant of time based on the discrete way in which processor instructions are executed with each clock cycle with corresponding memory accesses following[2], [4]. It should be clear that inconsistencies occur when data is written and noticed when data is read. It is the goal of this paper to explore the idiosyncrasies of the coherence mechanisms involved with dedicated caches via researching two common types of mechanisms, snoop-based and directory-based, and simulating their operations on top of a simulated architecture consisting of multiple processing cores and a layered cache system consisting of dedicated caches. In this paper, we implemented snoopy and directory protocols, and measure hit rate, compulsory miss rate, capacity miss rate, and coherence forces for each one. In addition, we show that how each scheme affected by block size and cache size.
... Memory references from the different processor to shared data variables can access the synchronization variables. It means that a developer is not able to make any assumptions on the ordering event between synchronization points while using the weak-ordering model [7], [8]. In efforts to prevent the nondeterministic operations each processor must then guarantee that the outstanding shared memory accesses are first completed before a synchronization operation can be issued. ...
... • Small cache blocks can reduce the coherence forces but will lead to more time. • Large cache blocks might cause unnecessary coherence forces due to false sharing [8]. We used the following parameters with different block sizes to verify the previous facts:  Snoopy-based protocol. ...
Article
Full-text available
Many modern computing architectures that utilize dedicated caches rely on coherency mechanisms to maintain consistency across dedicated caches [2]. These mechanisms, which are to be the focus of this paper, rely on underlying hardware synchronicity to resolve the issue of the value of a particular piece of data at a given instant of time based on the discrete way in which processor instructions are executed with each clock cycle with corresponding memory accesses following[2], [4]. It should be clear that inconsistencies occur when data is written and noticed when data is read. It is the goal of this paper to explore the idiosyncrasies of the coherence mechanisms involved with dedicated caches via researching two common types of mechanisms, snoop-based and directory-based, and simulating their operations on top of a simulated architecture consisting of multiple processing cores and a layered cache system consisting of dedicated caches. In this paper, we implemented snoopy and directory protocols, and measure hit rate, compulsory miss rate, capacity miss rate, and coherence forces for each one. In addition, we show that how each scheme affected by block size and cache size.
... Dubnicki [Dubnicki, 1993] ...
Article
An important architectural design decision affecting the performance of coherent caches is the choice of block size. There are two primary factors that influence this choice: the reference behavior of applications and the remote access bandwidth and latency of the machine. Given that we anticipate increases in both network bandwidth and latency (in processor cycles) in scalable shared-memory multiprocessors, the question arises as to what effect these increases will have on the choice of block size. We use analytical modeling and execution-driven simulation of parallel programs on a large-scale shared-memory machine to examine the relationship between cache block size and application performance as a function of remote access bandwidth and latency. We show that even under assumptions of high remote access bandwidth and latency, the best application performance usually results from using cache blocks between S2 and 128 bytes in size. We also show that modifying the program to remove the dominant source of misses may not increase the best performing block size. We conclude that large cache blocks cannot be justified in most realistic scenarios.
Article
This thesis has two main goals: the study and the implementation of an emulator of parallel computers with shared virtual memories. These two main objects have led this thesis to be divided in two distinct parts. The first part is a study of every technics that can be used to construct memories hierarchy, or that can be used to maintain data consistency in such memories. The second part describes the emulator. The objective of any emulator is to be enough convenient as possible. For this reason our emulator must include sufficient parameters to emulated the widest possible different types of parallel machines, while having a response delay not tremendously different regards to the one from a real execution. To meet this last requirement our emulator really executes all the instructions except the page exchanges across the interconnection network that are simulated. The parameters of the emulator are: number of processors, network's characteristics (e.g latency, bit rate), and consistency methods to describe the target machine; the data size and distribution to describe applications. Our emulator execute himself onto a MACH micro-kernel and a UNIX server. It use some functions of the MACH micro-kernel, specially external pager.
Article
Large-scale, shared-memory multiprocessors have non-uniform memory access (NUMA) costs. The high communication cost dominates the source of matrix computations' execution. Memory contention and remote memory access are two major communication overheads on large-scale NUMA multiprocessors. However, previous experiments and discussions focus either on reducing the number of remote memory accesses or on alleviating memory contention overhead. In this paper, we propose a simple but effective processor allocation policy, called rectangular processor allocation, to alleviate both overheads at the same time. The policy divides the matrix elements into a certain number of rectangular blocks, and assigns each processor to compute the results of one rectangular block. This methodology may reduce a lot of unnecessary memory accesses to the memory modules. After running many matrix computations under a realistic memory system simulator, we confirmed that at least one-fourth of the communication overhead may be reduced. Therefore, we conclude that rectangular processor allocation policy performs better than other popular policies, and that the combination of rectangular processor allocation policy with software interleaving data allocation policy is a better choice to alleviate communication overhead.