ArticlePDF Available

MEMORY WALL – A CRITICAL FACTOR IN CURRENT HIGH-PERFORMANCE MICROPROCESSORS

January 2006

January 2006

Authors:

Adrian Florea

Lucian Blaga University of Sibiu

Arpad Gellert

Lucian Blaga University of Sibiu

The memory access time is a critical factor that limits the performance of current microprocessors. Critical high-latency instructions caused by cache misses can slow a processor well below its peak potential. Pointer chasing from applications are responsible for reducing the overall processor performance because it creates a serialization in the execution of long-latency memory operations, inhibiting the overlap. In this paper we show some statistics on SPEC2000 benchmarks related to critical load instructions (loads that access with miss the L2 data cache and reached the head of the ROB). We obtained that 29.13% of critical loads appear at a distance at most 2 instructions. We also determined how many critical load instructions are not overlapping their access to main memory. In average 10.91% of critical loads are isolated or in front of a group of loads.

The memory dependence chain based on the above code. (a) The dependence chain for a single iteration. (b) The dependence chain for multiple iterations.

…

The Decoupled Kilo-Instruction Processor.

…

Average percentage of consecutive critical loads occurred at different distances. Coarse-grained statistics on SPEC2000 regarding the number of instructions that are committed between consecutive critical loads.

…

Average percentage of consecutive critical loads occurred at different distances. Fine-grained statistics on SPEC2000 regarding the number of instructions that are committed between consecutive critical loads.

…

Percentage of critical loads without overlapping. Statistics regarding overlapped accesses to main memory of two critical loads.

…

Figures - uploaded by Arpad Gellert

Content may be subject to copyright.

Content uploaded by Arpad Gellert

Content may be subject to copyright.

MEMORY WALL – A CRITICAL FACTOR

IN CURRENT HIGH-PERFORMANCE

MICROPROCESSORS

Adrian Florea, Arpad Gellert

“Lucian Blaga” University of Sibiu, Computer Science Department,

Emil Cioran Street, No. 4, 550025 Sibiu, Romania, {adrian.florea, arpad.gellert}@ulbsibiu.ro

Abstract

The memory access time is a critical factor that limits the performance of current microprocessors.

Critical high-latency instructions caused by cache misses can slow a processor well below its peak potential.

Pointer chasing from applications are responsible for reducing the overall processor performance because it

creates a serialization in the execution of long-latency memory operations, inhibiting the overlap. In this paper

we show some statistics on SPEC2000 benchmarks related to critical load instructions (loads that access with

miss the L2 data cache and reached the head of the ROB). We obtained that 29.13% of critical loads appear at a

distance at most 2 instructions. We also determined how many critical load instructions are not overlapping

their access to main memory. In average 10.91% of critical loads are isolated or in front of a group of loads.

Different approaches in overcoming the memory wall

In the last decade processor cycle time has been decreasing at a rate faster than the memory access

time. Additionally, the architectural design of processors has improved with the development of deeper

pipelined and multiple instruction issue architectures. These factors have influenced the increasing

technological gap – named memory wall [3] – between processor speed and the speed of the underlying

memory hierarchy. Thus, in order to improve the performance of memory-intensive application programs,

there are needed innovative techniques to tolerate long-latency main memory accesses.

Recently, different approaches were conducted to design microarchitectures able to overcome the

memory wall by introducing techniques for efficient management of architectural resources (Reorder Buffer

– ROB, register files, instruction queues and load/store queues). Akkary et al. [1] introduced the Checkpoint

Processing and Recovery microarchitecture – ROB-free and requiring only a small number of rename map

checkpoints selectively created at low-confidence branches, while capable of supporting a large instruction

window of the order of thousands of instructions. In [5] the authors propose runahead execution as an

effective way to increase memory latency tolerance in an out-of-order processor, without requiring an

unreasonably large instruction window. Runahead execution unlocks the instruction window blocked by long

latency operations allowing the processor to execute far ahead in the program path. In [4] is proposed a

hybrid mode of execution based on ROB and checkpointing that decouples resource recycling and

instruction retirement – Checkpointed Early Resource Recycling (Cherry). In this approach it is used a single

checkpoint outside the ROB to divide it into two regions: a speculative region and a non-speculative region.

Cherry is then able to early release physical registers and LSQ entries for instructions in the non-speculative

ROB section.

Cristal et al. [3] proposed the Kilo-Instruction Processors (KIP) – a solution to overcome the

increasing gap between processor performance and memory speed. KIP is a new type of out-of-order

superscalar processor that overlaps long memory access delays by maintaining thousand of in-flight

instructions. The traditional approach of scaling up critical processor structures to provide such support is

impractical at these levels, due to area, power, and cycle time constraints. In order to overcome this resource-

scalability problem it was proposed a smarter use of the available resources, supported by a selective

checkpointing mechanism. This mechanism allows instructions to commit out of order, and makes a reorder

buffer unnecessary. KIP is based on a set of techniques such as multilevel instruction queues, late allocation

and early release of registers, and early release of load/store queue entries.

However, KIP performance achieved on integer applications is sometimes limited by pointer chasing

and hard to predict branches. Pointer chasing is even more harmful than hard-to-predict branches, because it

creates a serialization in the execution of long-latency memory operations, inhibiting the overlap. To solve

this problem, value prediction techniques might be useful for predicting the addresses along a pointer chain,

thereby allowing the overlap of these accesses. For memory intensive workloads with heavy pointer chasing,

sequential cache-misses resulting from pointer chasing code structures dominate the overall execution time.

These cache-misses form a memory dependence chain since the address of a missing load is dependent on

the value of the previous missing load. Taking as an example the frequently executed code segment from

refresh_potential function (see below) belonging to mcf benchmark, the profile information shows that the

pointer chasing codes ‘node->child’, ‘node->basic_arc->cost’ and ‘node->pred->potential’ result in many

cache misses [10].

while( node )

{

if( node->orientation == UP )

node->potential = node->basic_arc->cost + node->pred->potential; // (Nodes 1,2,3,4)

else /* == DOWN */

{

node->potential = node->pred->potential - node->basic_arc->cost;

checksum++;

}

tmp = node;

node = node->child; // (Nodes 0, 5)

}

The memory dependence chains formed by these missing loads are shown below (see Figure 1).

1 3

Memory dependence

chain for one iteration

iteration

(a) (b)

1 3

Memory dependence

chain for one iteration

1 3

Memory dependence

chain for one iteration

iteration

(a) (b)

Figure 1. The memory dependence chain based on the above code.

(a) The dependence chain for a single iteration.

(b) The dependence chain for multiple iterations.

In Figure 1(a), the dependence chain is based on a single iteration of the while loop, where nodes 1

and 2 correspond to two dependent missing loads from ‘node->basic_arc->cost’. Nodes 3 and 4 correspond

to ‘node->pred->potential’. Node 5 corresponds to ‘node->child’ and node 0 is the same load ‘node->child’

from the previous iteration. Figure 1(b) shows the dependence chain when the loop is unrolled multiple

times. The solid arrows in Figure 1 represent the true data dependences and the dashed arrows represent the

alias dependences between missing loads.

The Simulated Architecture and Results

Starting from these challenges we made some analysis based on Simplescalar-3.0 tool set [2]. The

simulated infrastructure is designed to execute Alpha binaries and traces. The baseline cycle accurate

simulator used was the Decoupled Kilo Instruction Processor (DKIP) simulator developed by UPC team for

HPCA-2006 [7]. Figure 2 illustrates the architecture of DKIP. We developed on this structure some own

modules in order to obtain different statistics related to critical load instructions (loads that access with miss

the L2 data cache and reached the head of the ROB). The workbench consists of some computer intensive

integer benchmarks – bzip2, gcc, gzip – and some memory-intensive integer benchmarks – mcf, parser, twolf

and vpr –, selected from SPEC2000 [8]. To perform this job we ran the simulators on SUN machines with

Sparc processors under UNIX operating systems.

Decode

Rename

Integer

queue

file

ALU

Reorder buffer (RB)

Integer

queue

file

ALU

Checkpointing stack

Low locality

instruction buffer

Low locality

Load

queue

Store

queue

Store buffer

Cache Processor (out-of-order) Address Processor

Memory Processor (in-order)

Decode

Rename

Decode

Rename

Integer

queue

file

ALUALU

Reorder buffer (RB)

Integer

queue

file

ALUALU

Checkpointing stackCheckpointing stack

Low locality

instruction buffer

Low locality

instruction buffer

Low locality

Load

queue

Load

queue

Store

queue

Store buffer

Cache Processor (out-of-order) Address Processor

Memory Processor (in-order)

Figure 2. The Decoupled Kilo-Instruction Processor.

We first determined the number of instructions decoded between two load instructions that are

dispatched (loads are moved from the Cache Processor during the writeback stage into the execution units

LQ/SQ of the address processor or into LLIB) [7]. The simulation results show that 2 loads follows one after

another in 27.28% of cases, at average. The operations on dynamic data structures met in applications with

link list, trees or hash tables are responsible for the previous result.

Next, we take statistics between 2 load instructions that access with miss the Data Cache L2 and they

are situated in the head of ROB (critical). The critical loads could be used for a more selective checkpoint

mechanism. We determined the number of instructions that make commit phase between such two

consecutive memory accesses. We obtained that 29.13% of critical loads appear at a distance at most 2

instructions (see Figure 4). We also found how many critical load instructions are not overlapping their

access to main memory. We obtained that in average 10.91% of critical loads are isolated or in front of a

group of loads (see Figure 5).

56.09%

10%

20%

30%

40%

50%

60%

dist->10

dist->20

dist->30

dist->40

dist->50

dist->60

dist->70

dist->80

dist->90

dist->100

dist->110

dist->120

dist->130

dist->140

dist->150

dist->160

dist->170

dist->180

dist->190

dist->200

Distance

Critical Loads [%]

Figure 3. Average percentage of consecutive critical loads occurred at different distances. Coarse-grained statistics on

SPEC2000 regarding the number of instructions that are committed between consecutive critical loads.

19.16%

9.97%

10%

15%

20%

25%

dist=1 dist=2 dist=3 dist=4 dist=5 dist=6 dist=7 dist=8 dist=9 dist=10

Distance

Critical Loads [%]

Figure 4. Average percentage of consecutive critical loads occurred at different distances. Fine-grained statistics on

SPEC2000 regarding the number of instructions that are committed between consecutive critical loads.

10.91% 13.21%

10%

15%

20%

25%

30%

35%

bzip2

gcc

gzip

mcf

parser

twolf

vpr

SPEC2000 CPUInt benchmarks

Critical Loads [%]

mindist=lat

mindist=3/4lat

mindist=1/2lat

Figure 5. Percentage of critical loads without overlapping. Statistics regarding overlapped accesses to main memory of

two critical loads.

In the next step we determined how many instructions are committed between a critical load and the

first dependent load instruction. The simulation results (see Figure 7) show that dependent loads follow

frequently immediately after a critical load (the dependent instruction is the first in 10.46% of cases,

respectively the second in 11.22% of cases, after a critical load).

57.66%

10%

20%

30%

40%

50%

60%

70%

dist->10

dist->20

dist->30

dist->40

dist->50

dist->60

dist->70

dist->80

dist->90

dist->100

Figure 6. Average percentage of loads that depend on critical loads. Coarse-grained statistics regarding the number of

instructions that are committed between a critical load and the first dependent load.

10.46%

11.22%

10%

12%

dist=1

dist=2

dist=3

dist=4

dist=5

dist=6

dist=7

dist=8

dist=9

dist=10

Distance

Dependent Loads [%]

Figure 7. Average percentage of loads that depend on critical loads. Fine-grained statistics regarding the number of

instructions that are committed between a critical load and the first dependent load

Further, we compute the percentage of committed critical load instructions and also the percentage of

dependent loads. The frequency of critical loads varies between 3.46% (bzip2) and 45.54% (mcf – natural,

taking into account the Figure 1) with an average of 17.73%. The highest percentage of dependent loads on

mcf and vpr proves that respective benchmarks are characterized by a heavy pointer chasing, and also, their

overall performance is quite limited [5]. Also, it can be observed that, in average, one of six (≈16.93%) load

instructions depends on a critical load. In the case of mcf benchmark, at least 11.90% of dependent loads are

also critical. Except gzip and vpr benchmarks, the percent of dependent loads is greater than that of critical

loads. Since every dependent load has attached only a critical load, likely more than one load instruction

depend on the same critical instruction. Also, it is possible that some of the dependent loads to be even

critical. However, the rest of dependent loads generate hit in the cache hierarchy (L1 or L2) and they would

be quickly executed if the critical loads (address and /or value) would be correctly predicted. Mutlu et al. [6]

proposes a solution for this kind of situations. Exploiting regular memory allocation patterns his value and

address-based prediction technique enables the pre-execution of dependent instructions, including load

instructions that incur long-latency cache misses.

38.14

17.73

45.54

16.93

66.36

bzip2

gcc

gzip

mcf

parser

twolf

vpr

SPEC2000 CPUInt benchmarks

Loads [%]

Critic al loads

Dependent loads

Figure 8. Comparative statistics regarding committed critical loads and dependent loads.

Future Work

Further we’ll continue our research trying to determine how many branches depend on a critical load.

We want to find if these low execution locality branches [7] are unbiased [9] and how we can model branch

predictors dedicated for these branches in KIP. Another idea is based on the feasibility of register/instruction

centered value predictor integrated into KIP.

Acknowledgment

The work has been performed under the Project HPC-EUROPA (RII3-CT-2003-506079), with the

support of the European Community – Research Infrastructure Action under the FP6 “Structuring the

European Research Area” Programme.

Publications

[1] Akkary H., Rajwar R., Srinivasan S.T., Checkpoint Processing and Recovery: Towards Scalable Large

Instruction Window Processors, Proceedings of the 36th Annual IEEE/ACM International Symposium on

Microarchitecture, ACM Press, 2003.

[2] Burger D., Austin T., The SimpleScalar tool set, version 2.0, Technical Report 1342, University of Wisconsin -

Madison Computer Sciences Department, 1997.

[3] Cristal A., Santana O., Cazorla F., Galluzi M., Ramirez T., Pericas M., Valero M., Kilo-Instruction Processors:

Overcoming the Memory Wall, IEEE Micro, Vol. 25, No. 3, 2005.

[4] Martinez J., Renau J., Huang M., Prvulovic M. and J. Torrellas, Cherry: Checkpointed early resource recycling

in out-of-order microprocessors, in Proc. of the 35th Intl. Symposion On Microarchitecture, 2002.

[5] Mutlu O., Stark J., Wilkerson C. and Patt Y.N., Runahead execution: An alternative to very large instruction

windows for out-of-order processors, In Proceedings of the 9th International Symposium on High Performance

Computer Architecture, 2003.

[6] Mutlu O., Kim H., Patt Y., Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead

Execution by Exploiting Regular Memory Allocation Patterns, Proceedings of the 38th Annual IEEE/ACM

International Symposium on Microarchitecture, Barcelona, 2005.

[7] Pericas M., Gonzalez R., Cristal A., Jimenez D., Valero M., A Decoupled KILO-Instruction Processor,

Proceedings of the 12th International Symposium on High Performance Computer Architecture, Austin, Texas,

2006.

[8] SPEC, The SPEC benchmark programs, http://www.spec.org.

[9] Vintan L., Gellert A., Florea A., Oancea M. Egan C., Understanding Prediction Limits Through Unbiased

Branches, submitted to the 11th ACSAC2006, China, 2006.

[10] Zhou Y., Conte T., Enhancing Memory Level Parallelism via Recovery-Free Value Prediction, Proceedings of

the 17th International Conference on Supercomputing, USA, 2003.

Advanced Prediction Methods Integrated Into Speculative Computer Architectures

Thesis

Full-text available

Nov 2008

Arpad Gellert

The number of instructions that can be processed simultaneously in multiple instruction issue (MII) microprocessors is limited by dependencies existing between instructions. To eliminate these dependencies modern architectures, some of them presented in Chapter 2 as prerequisites for this work, rely heavily on speculation. The main goal of this thesis is to increase instruction-level parallelism (ILP) and therefore the overall performance of superscalar and multithreaded microarchitectures through advanced dynamic anticipatory techniques like branch prediction, value prediction and instruction reuse. This work brings original contributions in identifying difficult-to-predict branches and improving their predictability, in characterizing the randomness of their behavior, and in developing some selectively applied value prediction and instruction reuse methods. Branch instructions, appearing in high level program constructs like if, switch, for, while, etc., are a major bottleneck in the exploitation of ILP. Therefore, almost all present-day multiple instruction issue microprocessors are using advanced branch prediction techniques in order to increase ILP. In order to improve performance, branches must be detected within the dynamic instruction stream, and both the direction taken by each branch and the branch target address must be correctly predicted. The quality of a prediction model is highly dependent on the quality of the available data. Especially the choice of the features to base the prediction on is important. The vast majority of branch prediction approaches rely on usage of a greater number of input features without taking into account the real causes (indirect jumps and calls and, especially, unbiased branches) that produce a lower accuracy and implicit lower performance. In Chapter 3 we identified difficult-to-predict branches as being unbiased branches that have a “random” dynamic behavior, and tried to improve their predictability through context length extension. In Chapter 4 we showed that present-day branch predictors cannot accurately predict these branches due to their limited prediction information (branch address, local/global branch history, path). Therefore we improved several state-of-the-art branch predictors with additional prediction information, namely the previous branch condition or even a compressed branch condition history, in order to improve their prediction accuracy. We also showed in Chapter 5 that sequences generated by unbiased branches are characterized by high random degrees. Long-latency instructions, especially critical Loads due to their memory wall problem (the increasing gap between processor and memory speeds), represent another source of ILP limitation. In Chapter 6 we developed a superscalar architecture that selectively anticipates the values produced by long-latency instructions. We focused on Multiply, Division and Loads with miss in the L1 data cache. Thus, we implemented a Dynamic Instruction Reuse scheme for the Mul/Div instructions and a simple Last Value Predictor for the critical Load instructions. We also extended dynamic value prediction by introducing the concept of register-centric prediction instead of instruction-centric prediction. The register value prediction technique consists in predicting registers’ next values based on the previously seen values. It executes the subsequent data dependent instructions using the predicted values. In Chapter 7 we evaluated a simultaneous multithreaded architecture enhanced with selective instruction reuse and value prediction to anticipate the results of long-latency instructions. Finally, Chapter 8 concludes the thesis pointing out the original contributions and suggests some further work directions.

The SimpleScalar tool set, version 2.0

Article

Full-text available

Jan 2002
Comput Architect News

This document describes release 2.0 of the SimpleScalar tool set, a suite of free, publicly available simulation tools that offer both detailed and high-performance simulation of modern microprocessors. The new release offers more tools and capabilities, precompiled binaries, cleaner interfaces, better documentation, easier installation, improved portability, and higher performance. This paper contains a complete description of the tool set, including retrieval and installation instructions, a description of how to use the tools, a description of the target SimpleScalar architecture, and many details about the internals of the tools and how to customize them. With this guide, the tool set can be brought up and generating results in under an hour (on supported platforms).

Understanding Prediction Limits Through Unbiased Branches

Conference Paper

Full-text available

Sep 2006
Lect Notes Comput Sci

The majority of currently available branch predictors base their prediction accuracy on the previous k branch outcomes. Such predictors sustain high prediction accuracy but they do not consider the impact of unbiased branches which are difficult-to-predict. In this paper, we quantify and evaluate the impact of unbiased branches and show that any gain in prediction accuracy is proportional to the frequency of unbiased branches. By using the SPECcpu2000 integer benchmarks we show that there are a significant proportion of unbiased branches which severely impact on prediction accuracy (averaging between 6% and 24% depending on the prediction context used).

A decoupled KILO-instruction processor

Conference Paper

Full-text available

Mar 2006

Building processors with large instruction windows has been proposed as a mechanism for overcoming the memory wall, but finding a feasible and implementable design has been an elusive goal. Traditional processors are composed of structures that do not scale to large instruction windows because of timing and power constraints. However, the behavior of programs executed with large instruction windows gives rise to a natural and simple alternative to scaling. We characterize this phenomenon of execution locality and propose a microarchitecture to exploit it to achieve the benefit of a large instruction window processor with low implementation cost. Execution locality is the tendency of instructions to exhibit high or low latency based on their dependence on memory operations. In this paper we propose a decoupled microarchitecture that executes low latency instructions on a cache processor and high latency instructions on a memory processor. We demonstrate that such a design, using small structures and many in-order components, can achieve the same performance as much more aggressive proposals while minimizing design complexity.

Runahead execution: an alternative to very large instruction windows for out-of-order processors

Conference Paper

Full-text available

Mar 2003

Today's high performance processors tolerate long latency operations by means of out-of-order execution. However, as latencies increase, the size of the instruction window must increase even faster if we are to continue to tolerate these latencies. We have already reached the point where the size of an instruction window that can handle these latencies is prohibitively large in terms of both design complexity and power consumption. And, the problem is getting worse. This paper proposes runahead execution as an effective way to increase memory latency tolerance in an out-of-order processor without requiring an unreasonably large instruction window. Runahead execution unblocks the instruction window blocked by long latency operations allowing the processor to execute far ahead in the program path. This results in data being prefetched into caches long before it is needed. On a machine model based on the Intel® Pentium® processor, having a 128-entry instruction window, adding runahead execution improves the IPC (instructions per cycle) by 22% across a wide range of memory intensive applications. Also, for the same machine model, runahead execution combined with a 128-entry window performs within 1% of a machine with no runahead execution and a 384-entry instruction window.

Enhancing memory level parallelism via recovery-free value prediction

Conference Paper

Jan 2003

Address-value delta (AVD) prediction: Increasing the effectiveness of runahead execution by exploiting regular memory allocation patterns

Conference Paper

Dec 2005

While runahead execution is effective at parallelizing independent long-latency cache misses, it is unable to parallelize dependent long-latency cache misses. To overcome this limitation, this paper proposes a novel technique, address-value delta (AVD) prediction. An AVD predictor keeps track of the address (pointer) load instructions for which the arithmetic difference (i.e., delta) between the effective address and the data value is stable. If such a load instruction incurs a long-latency cache miss during runahead execution, its data value is predicted by subtracting the stable delta from its effective address. This prediction enables the pre-execution of dependent instructions, including load instructions that incur long-latency cache misses. We describe how, why, and for what kind of loads AVD prediction works and evaluate the design tradeoffs in an implementable AVD predictor. Our analysis shows that stable AVDs exist because of patterns in the way data structures are allocated in memory. Our results show that augmenting a runahead processor with a simple, 16-entry AVD predictor improves the average execution time of a set of pointer-intensive applications by 12.1%.

Checkpoint processing and recovery: towards scalable large instruction window processors

Conference Paper

Jan 2004

Large instruction window processors achieve high performance by exposing large amounts of instruction level parallelism. However, accessing large hardware structures typically required to buffer and process such instruction window sizes significantly degrade the cycle time. This paper proposes a checkpoint processing and recovery (CPR) microarchitecture, and shows how to implement a large instruction window processor without requiring large structures thus permitting a high clock frequency. We focus of four critical aspects of a microarchitecture: 1) scheduling instructions; 2) recovering from branch mispredicts; 3) buffering a large number of stores and forwarding data from stores to any dependent load; and 4) reclaiming physical registers. While scheduling window size is important, we show the performance of large instruction windows to be more sensitive to the other three design issues. Our CPR proposal incorporates novel microarchitecture scheme for addressing these design issues-a selective checkpoint mechanism for recovering from mispredicts, a hierarchical store queue organization for fast store-load forwarding, and an effective algorithm for aggressive physical register reclamation. Our proposals allow a processor to realize performance gains due to instruction windows of thousands of instructions without requiring large cycle-critical hardware structures.

Cherry: Checkpointed early resource recycling in out-of-order microprocessors

Conference Paper

Feb 2002

This paper presents checkpointed early resource recycling (Cherry), a hybrid mode of execution based on ROB and checkpointing that decouples resource recycling and instruction retirement. Resources are recycled early, resulting in a more efficient utilization. Cherry relies on state checkpointing and rollback to service exceptions for instructions whose resources have been recycled. Cherry leverages the ROB to (1) not require in-order execution as a fallback mechanism, (2) allow memory replay traps and branch mispredictions without rolling back to the Cherry checkpoint, and (3) quickly fall back to conventional out-of-order execution without rolling back to the checkpoint or flushing the pipeline. We present a Cherry implementation with early recycling at three different points of the execution engine: the load queue, the store queue, and the register file. We report average speedups of 1.06 and 1.26 in SPECint and SPECfp applications, respectively, relative to an aggressive conventional architecture. We also describe how Cherry and speculative multithreading can be combined and complement each other.

Enhancing Memory-Level Parallelism via Recovery-Free Value Prediction

Article

Aug 2005

The ever-increasing computational power of contemporary microprocessors reduces the execution time spent on arithmetic computations (i.e., the computations not involving slow memory operations such as cache misses) significantly. Therefore, for memory-intensive workloads, it becomes more important to overlap multiple cache misses than to overlap slow memory operations with other computations. In this paper, we propose a novel technique to parallelize sequential cache misses, thereby increasing memory-level parallelism (MLP). Our idea is based on value prediction, which was proposed originally as an instruction-level parallelism (ILP) optimization to break true data dependencies. In this paper, we advocate value prediction in its capability to enhance MLP instead of ILP. We propose using value prediction and value-speculative execution only for prefetching so that not only the complex prediction validation and misprediction recovery mechanisms are avoided, but better performance can also be achieved for memory-intensive workloads. The minor hardware modifications that are required also enable aggressive memory disambiguation for prefetching. The experimental results show that our technique enhances MLP effectively and achieves significant speedups, even with a simple stride value predictor.

Jan 2005

A Cristal
O Santana
F Cazorla
M Galluzi
T Ramirez
M Pericas
M Valero

Cristal A., Santana O., Cazorla F., Galluzi M., Ramirez T., Pericas M., Valero M., Kilo-Instruction Processors: Overcoming the Memory Wall, IEEE Micro, Vol. 25, No. 3, 2005.

MEMORY WALL – A CRITICAL FACTOR IN CURRENT HIGH-PERFORMANCE MICROPROCESSORS

Abstract and Figures

Recommended publications

PERANCANGAN MAINTENANCE YANG OPTIMAL DENGAN MENGGUNAKAN METODE RELIABILITY CENTERED MAINTENANCE (RCM...

Study on desulfurization in flue gas using hydroxyl radical generated by strong electric field disch...

Stiffness Analysis of Asymmetric Double-row Four-point-contact Ball Slewing Bearing

Some Design Aspects of Poppet-Valve Cylinder Heads for Spark-Ignition, Liquid-Cooled Engines