ArticlePDF Available

MEMORY WALL – A CRITICAL FACTOR IN CURRENT HIGH-PERFORMANCE MICROPROCESSORS

Authors:

Abstract and Figures

The memory access time is a critical factor that limits the performance of current microprocessors. Critical high-latency instructions caused by cache misses can slow a processor well below its peak potential. Pointer chasing from applications are responsible for reducing the overall processor performance because it creates a serialization in the execution of long-latency memory operations, inhibiting the overlap. In this paper we show some statistics on SPEC2000 benchmarks related to critical load instructions (loads that access with miss the L2 data cache and reached the head of the ROB). We obtained that 29.13% of critical loads appear at a distance at most 2 instructions. We also determined how many critical load instructions are not overlapping their access to main memory. In average 10.91% of critical loads are isolated or in front of a group of loads.
Content may be subject to copyright.
MEMORY WALL – A CRITICAL FACTOR
IN CURRENT HIGH-PERFORMANCE
MICROPROCESSORS
Adrian Florea, Arpad Gellert
“Lucian Blaga” University of Sibiu, Computer Science Department,
Emil Cioran Street, No. 4, 550025 Sibiu, Romania, {adrian.florea, arpad.gellert}@ulbsibiu.ro
Abstract
The memory access time is a critical factor that limits the performance of current microprocessors.
Critical high-latency instructions caused by cache misses can slow a processor well below its peak potential.
Pointer chasing from applications are responsible for reducing the overall processor performance because it
creates a serialization in the execution of long-latency memory operations, inhibiting the overlap. In this paper
we show some statistics on SPEC2000 benchmarks related to critical load instructions (loads that access with
miss the L2 data cache and reached the head of the ROB). We obtained that 29.13% of critical loads appear at a
distance at most 2 instructions. We also determined how many critical load instructions are not overlapping
their access to main memory. In average 10.91% of critical loads are isolated or in front of a group of loads.
Different approaches in overcoming the memory wall
In the last decade processor cycle time has been decreasing at a rate faster than the memory access
time. Additionally, the architectural design of processors has improved with the development of deeper
pipelined and multiple instruction issue architectures. These factors have influenced the increasing
technological gap – named memory wall [3] between processor speed and the speed of the underlying
memory hierarchy. Thus, in order to improve the performance of memory-intensive application programs,
there are needed innovative techniques to tolerate long-latency main memory accesses.
Recently, different approaches were conducted to design microarchitectures able to overcome the
memory wall by introducing techniques for efficient management of architectural resources (Reorder Buffer
– ROB, register files, instruction queues and load/store queues). Akkary et al. [1] introduced the Checkpoint
Processing and Recovery microarchitecture – ROB-free and requiring only a small number of rename map
checkpoints selectively created at low-confidence branches, while capable of supporting a large instruction
window of the order of thousands of instructions. In [5] the authors propose runahead execution as an
effective way to increase memory latency tolerance in an out-of-order processor, without requiring an
unreasonably large instruction window. Runahead execution unlocks the instruction window blocked by long
latency operations allowing the processor to execute far ahead in the program path. In [4] is proposed a
hybrid mode of execution based on ROB and checkpointing that decouples resource recycling and
instruction retirement – Checkpointed Early Resource Recycling (Cherry). In this approach it is used a single
checkpoint outside the ROB to divide it into two regions: a speculative region and a non-speculative region.
Cherry is then able to early release physical registers and LSQ entries for instructions in the non-speculative
ROB section.
Cristal et al. [3] proposed the Kilo-Instruction Processors (KIP) – a solution to overcome the
increasing gap between processor performance and memory speed. KIP is a new type of out-of-order
superscalar processor that overlaps long memory access delays by maintaining thousand of in-flight
instructions. The traditional approach of scaling up critical processor structures to provide such support is
impractical at these levels, due to area, power, and cycle time constraints. In order to overcome this resource-
scalability problem it was proposed a smarter use of the available resources, supported by a selective
checkpointing mechanism. This mechanism allows instructions to commit out of order, and makes a reorder
buffer unnecessary. KIP is based on a set of techniques such as multilevel instruction queues, late allocation
and early release of registers, and early release of load/store queue entries.
However, KIP performance achieved on integer applications is sometimes limited by pointer chasing
and hard to predict branches. Pointer chasing is even more harmful than hard-to-predict branches, because it
creates a serialization in the execution of long-latency memory operations, inhibiting the overlap. To solve
this problem, value prediction techniques might be useful for predicting the addresses along a pointer chain,
thereby allowing the overlap of these accesses. For memory intensive workloads with heavy pointer chasing,
sequential cache-misses resulting from pointer chasing code structures dominate the overall execution time.
These cache-misses form a memory dependence chain since the address of a missing load is dependent on
the value of the previous missing load. Taking as an example the frequently executed code segment from
refresh_potential function (see below) belonging to mcf benchmark, the profile information shows that the
pointer chasing codes ‘node->child’, ‘node->basic_arc->cost’ and ‘node->pred->potential’ result in many
cache misses [10].
while( node )
{
if( node->orientation == UP )
node->potential = node->basic_arc->cost + node->pred->potential; // (Nodes 1,2,3,4)
else /* == DOWN */
{
node->potential = node->pred->potential - node->basic_arc->cost;
checksum++;
}
tmp = node;
node = node->child; // (Nodes 0, 5)
}
The memory dependence chains formed by these missing loads are shown below (see Figure 1).
1 3
5
2
0
4
Memory dependence
chain for one iteration
1
st
iteration
2
nd
iteration
n
th
iteration
(a) (b)
1 3
5
2
0
4
Memory dependence
chain for one iteration
1 3
5
2
0
4
1 3
5
2
0
4
Memory dependence
chain for one iteration
1
st
iteration
2
nd
iteration
n
th
iteration
1
st
iteration
2
nd
iteration
n
th
iteration
(a) (b)
Figure 1. The memory dependence chain based on the above code.
(a) The dependence chain for a single iteration.
(b) The dependence chain for multiple iterations.
In Figure 1(a), the dependence chain is based on a single iteration of the while loop, where nodes 1
and 2 correspond to two dependent missing loads from ‘node->basic_arc->cost’. Nodes 3 and 4 correspond
to ‘node->pred->potential’. Node 5 corresponds to ‘node->child’ and node 0 is the same load ‘node->child
from the previous iteration. Figure 1(b) shows the dependence chain when the loop is unrolled multiple
times. The solid arrows in Figure 1 represent the true data dependences and the dashed arrows represent the
alias dependences between missing loads.
The Simulated Architecture and Results
Starting from these challenges we made some analysis based on Simplescalar-3.0 tool set [2]. The
simulated infrastructure is designed to execute Alpha binaries and traces. The baseline cycle accurate
simulator used was the Decoupled Kilo Instruction Processor (DKIP) simulator developed by UPC team for
HPCA-2006 [7]. Figure 2 illustrates the architecture of DKIP. We developed on this structure some own
modules in order to obtain different statistics related to critical load instructions (loads that access with miss
the L2 data cache and reached the head of the ROB). The workbench consists of some computer intensive
integer benchmarks – bzip2, gcc, gzip – and some memory-intensive integer benchmarks – mcf, parser, twolf
and vpr –, selected from SPEC2000 [8]. To perform this job we ran the simulators on SUN machines with
Sparc processors under UNIX operating systems.
Decode
&
Rename
Integer
queue
FP
queue
Register
file
ALU
Reorder buffer (RB)
Integer
queue
FP
queue
Register
file
ALU
Checkpointing stack
Low locality
instruction buffer
Low locality
register file
Load
queue
Store
queue
Store buffer
Cache Processor (out-of-order) Address Processor
Memory Processor (in-order)
Decode
&
Rename
Decode
&
Rename
Integer
queue
FP
queue
FP
queue
Register
file
Register
file
ALUALU
Reorder buffer (RB)
Integer
queue
FP
queue
FP
queue
Register
file
Register
file
ALUALU
Checkpointing stackCheckpointing stack
Low locality
instruction buffer
Low locality
instruction buffer
Low locality
register file
Low locality
register file
Load
queue
Load
queue
Store
queue
Store buffer
Cache Processor (out-of-order) Address Processor
Memory Processor (in-order)
Figure 2. The Decoupled Kilo-Instruction Processor.
We first determined the number of instructions decoded between two load instructions that are
dispatched (loads are moved from the Cache Processor during the writeback stage into the execution units
LQ/SQ of the address processor or into LLIB) [7]. The simulation results show that 2 loads follows one after
another in 27.28% of cases, at average. The operations on dynamic data structures met in applications with
link list, trees or hash tables are responsible for the previous result.
Next, we take statistics between 2 load instructions that access with miss the Data Cache L2 and they
are situated in the head of ROB (critical). The critical loads could be used for a more selective checkpoint
mechanism. We determined the number of instructions that make commit phase between such two
consecutive memory accesses. We obtained that 29.13% of critical loads appear at a distance at most 2
instructions (see Figure 4). We also found how many critical load instructions are not overlapping their
access to main memory. We obtained that in average 10.91% of critical loads are isolated or in front of a
group of loads (see Figure 5).
56.09%
0%
10%
20%
30%
40%
50%
60%
dist->10
dist->20
dist->30
dist->40
dist->50
dist->60
dist->70
dist->80
dist->90
dist->100
dist->110
dist->120
dist->130
dist->140
dist->150
dist->160
dist->170
dist->180
dist->190
dist->200
Distance
Critical Loads [%]
Figure 3. Average percentage of consecutive critical loads occurred at different distances. Coarse-grained statistics on
SPEC2000 regarding the number of instructions that are committed between consecutive critical loads.
19.16%
9.97%
0%
5%
10%
15%
20%
25%
dist=1 dist=2 dist=3 dist=4 dist=5 dist=6 dist=7 dist=8 dist=9 dist=10
Distance
Critical Loads [%]
Figure 4. Average percentage of consecutive critical loads occurred at different distances. Fine-grained statistics on
SPEC2000 regarding the number of instructions that are committed between consecutive critical loads.
10.91% 13.21%
0%
5%
10%
15%
20%
25%
30%
35%
bzip2
gcc
gzip
mcf
parser
twolf
vpr
AM
SPEC2000 CPUInt benchmarks
Critical Loads [%]
mindist=lat
mindist=3/4lat
mindist=1/2lat
Figure 5. Percentage of critical loads without overlapping. Statistics regarding overlapped accesses to main memory of
two critical loads.
In the next step we determined how many instructions are committed between a critical load and the
first dependent load instruction. The simulation results (see Figure 7) show that dependent loads follow
frequently immediately after a critical load (the dependent instruction is the first in 10.46% of cases,
respectively the second in 11.22% of cases, after a critical load).
57.66%
0%
10%
20%
30%
40%
50%
60%
70%
dist->10
dist->20
dist->30
dist->40
dist->50
dist->60
dist->70
dist->80
dist->90
dist->100
Figure 6. Average percentage of loads that depend on critical loads. Coarse-grained statistics regarding the number of
instructions that are committed between a critical load and the first dependent load.
10.46%
11.22%
0%
2%
4%
6%
8%
10%
12%
dist=1
dist=2
dist=3
dist=4
dist=5
dist=6
dist=7
dist=8
dist=9
dist=10
Distance
Dependent Loads [%]
Figure 7. Average percentage of loads that depend on critical loads. Fine-grained statistics regarding the number of
instructions that are committed between a critical load and the first dependent load
Further, we compute the percentage of committed critical load instructions and also the percentage of
dependent loads. The frequency of critical loads varies between 3.46% (bzip2) and 45.54% (mcf – natural,
taking into account the Figure 1) with an average of 17.73%. The highest percentage of dependent loads on
mcf and vpr proves that respective benchmarks are characterized by a heavy pointer chasing, and also, their
overall performance is quite limited [5]. Also, it can be observed that, in average, one of six (16.93%) load
instructions depends on a critical load. In the case of mcf benchmark, at least 11.90% of dependent loads are
also critical. Except gzip and vpr benchmarks, the percent of dependent loads is greater than that of critical
loads. Since every dependent load has attached only a critical load, likely more than one load instruction
depend on the same critical instruction. Also, it is possible that some of the dependent loads to be even
critical. However, the rest of dependent loads generate hit in the cache hierarchy (L1 or L2) and they would
be quickly executed if the critical loads (address and /or value) would be correctly predicted. Mutlu et al. [6]
proposes a solution for this kind of situations. Exploiting regular memory allocation patterns his value and
address-based prediction technique enables the pre-execution of dependent instructions, including load
instructions that incur long-latency cache misses.
38.14
17.73
45.54
16.93
66.36
0
10
20
30
40
50
60
70
bzip2
gcc
gzip
mcf
parser
twolf
vpr
AM
SPEC2000 CPUInt benchmarks
Loads [%]
Critic al loads
Dependent loads
Figure 8. Comparative statistics regarding committed critical loads and dependent loads.
Future Work
Further we’ll continue our research trying to determine how many branches depend on a critical load.
We want to find if these low execution locality branches [7] are unbiased [9] and how we can model branch
predictors dedicated for these branches in KIP. Another idea is based on the feasibility of register/instruction
centered value predictor integrated into KIP.
Acknowledgment
The work has been performed under the Project HPC-EUROPA (RII3-CT-2003-506079), with the
support of the European Community – Research Infrastructure Action under the FP6 “Structuring the
European Research Area” Programme.
Publications
[1] Akkary H., Rajwar R., Srinivasan S.T., Checkpoint Processing and Recovery: Towards Scalable Large
Instruction Window Processors, Proceedings of the 36th Annual IEEE/ACM International Symposium on
Microarchitecture, ACM Press, 2003.
[2] Burger D., Austin T., The SimpleScalar tool set, version 2.0, Technical Report 1342, University of Wisconsin -
Madison Computer Sciences Department, 1997.
[3] Cristal A., Santana O., Cazorla F., Galluzi M., Ramirez T., Pericas M., Valero M., Kilo-Instruction Processors:
Overcoming the Memory Wall, IEEE Micro, Vol. 25, No. 3, 2005.
[4] Martinez J., Renau J., Huang M., Prvulovic M. and J. Torrellas, Cherry: Checkpointed early resource recycling
in out-of-order microprocessors, in Proc. of the 35th Intl. Symposion On Microarchitecture, 2002.
[5] Mutlu O., Stark J., Wilkerson C. and Patt Y.N., Runahead execution: An alternative to very large instruction
windows for out-of-order processors, In Proceedings of the 9th International Symposium on High Performance
Computer Architecture, 2003.
[6] Mutlu O., Kim H., Patt Y., Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead
Execution by Exploiting Regular Memory Allocation Patterns, Proceedings of the 38th Annual IEEE/ACM
International Symposium on Microarchitecture, Barcelona, 2005.
[7] Pericas M., Gonzalez R., Cristal A., Jimenez D., Valero M., A Decoupled KILO-Instruction Processor,
Proceedings of the 12th International Symposium on High Performance Computer Architecture, Austin, Texas,
2006.
[8] SPEC, The SPEC benchmark programs, http://www.spec.org.
[9] Vintan L., Gellert A., Florea A., Oancea M. Egan C., Understanding Prediction Limits Through Unbiased
Branches, submitted to the 11th ACSAC2006, China, 2006.
[10] Zhou Y., Conte T., Enhancing Memory Level Parallelism via Recovery-Free Value Prediction, Proceedings of
the 17th International Conference on Supercomputing, USA, 2003.
Thesis
Full-text available
The number of instructions that can be processed simultaneously in multiple instruction issue (MII) microprocessors is limited by dependencies existing between instructions. To eliminate these dependencies modern architectures, some of them presented in Chapter 2 as prerequisites for this work, rely heavily on speculation. The main goal of this thesis is to increase instruction-level parallelism (ILP) and therefore the overall performance of superscalar and multithreaded microarchitectures through advanced dynamic anticipatory techniques like branch prediction, value prediction and instruction reuse. This work brings original contributions in identifying difficult-to-predict branches and improving their predictability, in characterizing the randomness of their behavior, and in developing some selectively applied value prediction and instruction reuse methods. Branch instructions, appearing in high level program constructs like if, switch, for, while, etc., are a major bottleneck in the exploitation of ILP. Therefore, almost all present-day multiple instruction issue microprocessors are using advanced branch prediction techniques in order to increase ILP. In order to improve performance, branches must be detected within the dynamic instruction stream, and both the direction taken by each branch and the branch target address must be correctly predicted. The quality of a prediction model is highly dependent on the quality of the available data. Especially the choice of the features to base the prediction on is important. The vast majority of branch prediction approaches rely on usage of a greater number of input features without taking into account the real causes (indirect jumps and calls and, especially, unbiased branches) that produce a lower accuracy and implicit lower performance. In Chapter 3 we identified difficult-to-predict branches as being unbiased branches that have a “random” dynamic behavior, and tried to improve their predictability through context length extension. In Chapter 4 we showed that present-day branch predictors cannot accurately predict these branches due to their limited prediction information (branch address, local/global branch history, path). Therefore we improved several state-of-the-art branch predictors with additional prediction information, namely the previous branch condition or even a compressed branch condition history, in order to improve their prediction accuracy. We also showed in Chapter 5 that sequences generated by unbiased branches are characterized by high random degrees. Long-latency instructions, especially critical Loads due to their memory wall problem (the increasing gap between processor and memory speeds), represent another source of ILP limitation. In Chapter 6 we developed a superscalar architecture that selectively anticipates the values produced by long-latency instructions. We focused on Multiply, Division and Loads with miss in the L1 data cache. Thus, we implemented a Dynamic Instruction Reuse scheme for the Mul/Div instructions and a simple Last Value Predictor for the critical Load instructions. We also extended dynamic value prediction by introducing the concept of register-centric prediction instead of instruction-centric prediction. The register value prediction technique consists in predicting registers’ next values based on the previously seen values. It executes the subsequent data dependent instructions using the predicted values. In Chapter 7 we evaluated a simultaneous multithreaded architecture enhanced with selective instruction reuse and value prediction to anticipate the results of long-latency instructions. Finally, Chapter 8 concludes the thesis pointing out the original contributions and suggests some further work directions.
Article
Full-text available
This document describes release 2.0 of the SimpleScalar tool set, a suite of free, publicly available simulation tools that offer both detailed and high-performance simulation of modern microprocessors. The new release offers more tools and capabilities, precompiled binaries, cleaner interfaces, better documentation, easier installation, improved portability, and higher performance. This paper contains a complete description of the tool set, including retrieval and installation instructions, a description of how to use the tools, a description of the target SimpleScalar architecture, and many details about the internals of the tools and how to customize them. With this guide, the tool set can be brought up and generating results in under an hour (on supported platforms).
Conference Paper
Full-text available
The majority of currently available branch predictors base their prediction accuracy on the previous k branch outcomes. Such predictors sustain high prediction accuracy but they do not consider the impact of unbiased branches which are difficult-to-predict. In this paper, we quantify and evaluate the impact of unbiased branches and show that any gain in prediction accuracy is proportional to the frequency of unbiased branches. By using the SPECcpu2000 integer benchmarks we show that there are a significant proportion of unbiased branches which severely impact on prediction accuracy (averaging between 6% and 24% depending on the prediction context used).
Conference Paper
Full-text available
Building processors with large instruction windows has been proposed as a mechanism for overcoming the memory wall, but finding a feasible and implementable design has been an elusive goal. Traditional processors are composed of structures that do not scale to large instruction windows because of timing and power constraints. However, the behavior of programs executed with large instruction windows gives rise to a natural and simple alternative to scaling. We characterize this phenomenon of execution locality and propose a microarchitecture to exploit it to achieve the benefit of a large instruction window processor with low implementation cost. Execution locality is the tendency of instructions to exhibit high or low latency based on their dependence on memory operations. In this paper we propose a decoupled microarchitecture that executes low latency instructions on a cache processor and high latency instructions on a memory processor. We demonstrate that such a design, using small structures and many in-order components, can achieve the same performance as much more aggressive proposals while minimizing design complexity.
Conference Paper
Full-text available
Today's high performance processors tolerate long latency operations by means of out-of-order execution. However, as latencies increase, the size of the instruction window must increase even faster if we are to continue to tolerate these latencies. We have already reached the point where the size of an instruction window that can handle these latencies is prohibitively large in terms of both design complexity and power consumption. And, the problem is getting worse. This paper proposes runahead execution as an effective way to increase memory latency tolerance in an out-of-order processor without requiring an unreasonably large instruction window. Runahead execution unblocks the instruction window blocked by long latency operations allowing the processor to execute far ahead in the program path. This results in data being prefetched into caches long before it is needed. On a machine model based on the Intel® Pentium® processor, having a 128-entry instruction window, adding runahead execution improves the IPC (instructions per cycle) by 22% across a wide range of memory intensive applications. Also, for the same machine model, runahead execution combined with a 128-entry window performs within 1% of a machine with no runahead execution and a 384-entry instruction window.
Conference Paper
While runahead execution is effective at parallelizing independent long-latency cache misses, it is unable to parallelize dependent long-latency cache misses. To overcome this limitation, this paper proposes a novel technique, address-value delta (AVD) prediction. An AVD predictor keeps track of the address (pointer) load instructions for which the arithmetic difference (i.e., delta) between the effective address and the data value is stable. If such a load instruction incurs a long-latency cache miss during runahead execution, its data value is predicted by subtracting the stable delta from its effective address. This prediction enables the pre-execution of dependent instructions, including load instructions that incur long-latency cache misses. We describe how, why, and for what kind of loads AVD prediction works and evaluate the design tradeoffs in an implementable AVD predictor. Our analysis shows that stable AVDs exist because of patterns in the way data structures are allocated in memory. Our results show that augmenting a runahead processor with a simple, 16-entry AVD predictor improves the average execution time of a set of pointer-intensive applications by 12.1%.
Conference Paper
Large instruction window processors achieve high performance by exposing large amounts of instruction level parallelism. However, accessing large hardware structures typically required to buffer and process such instruction window sizes significantly degrade the cycle time. This paper proposes a checkpoint processing and recovery (CPR) microarchitecture, and shows how to implement a large instruction window processor without requiring large structures thus permitting a high clock frequency. We focus of four critical aspects of a microarchitecture: 1) scheduling instructions; 2) recovering from branch mispredicts; 3) buffering a large number of stores and forwarding data from stores to any dependent load; and 4) reclaiming physical registers. While scheduling window size is important, we show the performance of large instruction windows to be more sensitive to the other three design issues. Our CPR proposal incorporates novel microarchitecture scheme for addressing these design issues-a selective checkpoint mechanism for recovering from mispredicts, a hierarchical store queue organization for fast store-load forwarding, and an effective algorithm for aggressive physical register reclamation. Our proposals allow a processor to realize performance gains due to instruction windows of thousands of instructions without requiring large cycle-critical hardware structures.
Conference Paper
This paper presents checkpointed early resource recycling (Cherry), a hybrid mode of execution based on ROB and checkpointing that decouples resource recycling and instruction retirement. Resources are recycled early, resulting in a more efficient utilization. Cherry relies on state checkpointing and rollback to service exceptions for instructions whose resources have been recycled. Cherry leverages the ROB to (1) not require in-order execution as a fallback mechanism, (2) allow memory replay traps and branch mispredictions without rolling back to the Cherry checkpoint, and (3) quickly fall back to conventional out-of-order execution without rolling back to the checkpoint or flushing the pipeline. We present a Cherry implementation with early recycling at three different points of the execution engine: the load queue, the store queue, and the register file. We report average speedups of 1.06 and 1.26 in SPECint and SPECfp applications, respectively, relative to an aggressive conventional architecture. We also describe how Cherry and speculative multithreading can be combined and complement each other.
Article
The ever-increasing computational power of contemporary microprocessors reduces the execution time spent on arithmetic computations (i.e., the computations not involving slow memory operations such as cache misses) significantly. Therefore, for memory-intensive workloads, it becomes more important to overlap multiple cache misses than to overlap slow memory operations with other computations. In this paper, we propose a novel technique to parallelize sequential cache misses, thereby increasing memory-level parallelism (MLP). Our idea is based on value prediction, which was proposed originally as an instruction-level parallelism (ILP) optimization to break true data dependencies. In this paper, we advocate value prediction in its capability to enhance MLP instead of ILP. We propose using value prediction and value-speculative execution only for prefetching so that not only the complex prediction validation and misprediction recovery mechanisms are avoided, but better performance can also be achieved for memory-intensive workloads. The minor hardware modifications that are required also enable aggressive memory disambiguation for prefetching. The experimental results show that our technique enhances MLP effectively and achieves significant speedups, even with a simple stride value predictor.
  • A Cristal
  • O Santana
  • F Cazorla
  • M Galluzi
  • T Ramirez
  • M Pericas
  • M Valero
Cristal A., Santana O., Cazorla F., Galluzi M., Ramirez T., Pericas M., Valero M., Kilo-Instruction Processors: Overcoming the Memory Wall, IEEE Micro, Vol. 25, No. 3, 2005.