ArticlePDF Available

Hitting the Memory Wall: Implications of the Obvious

Authors:

Abstract

namely that we are going to hit a wall in the improvement of system performance unless something basic changes. t avg p t c 1 p -- ( ) t m + = Hitting the Memory Wall: Implications of the Obvious Appeared in Computer Architecture News, 23(1):20-24, March 1995. 2 First let's assume that the cache speed matches that of the processor, and specifically that it scales with the processor speed. This is certainly true for on-chip cache, and allows us to easily normalize all our results in terms of instruction cycle times (essentially saying t c = 1 cpu cycle). Second, assume that the cache is perfect. That is, the cache never has a conflict or capacity miss; the only misses are the compulsory ones. Thus is just the probability of accessing a location that has never been referenced before (one can quibble and adjust this for line size, but this won't affect th
Hitting the Memory Wall: Implications of the Obvious
Win. A. Wulf
Sally A. McKee
Department of Computer Science
University of Virginia
{ wulf i mckee }@ virginia.edu
December 1994
This brief note points out something obvious m something the authors "knew" without
really understanding. With apologies to those who did understand, we offer it to those
others who, like us, missed the point.
We all know that the rate of improvement in microprocessor speed exceeds the rate of
improvement in DRAM memory speed, each is improving exponentially, but the
exponent for microprocessors is substantially larger than that for DRAMs. The difference
between diverging exponentials also grows exponentially; so, although the disparity
between processor and memory speed is already an issue, downstream someplace it will be
a much bigger one. How big and how soon? The answers to these questions are what the
authors had failed to appreciate.
To get a handle on the answers, consider an old friend the equation for the average time
to access memory, where t c and
t m are the
cache and DRAM access times andp is the
probability of a cache hit:
t =pxt +
(l-p) xt
avg c m
We want to look at how the average access time changes with technology, so we'll make
some conservative assumptions; as you'll see, the specific values won't change the basic
conclusion of this note, namely that we arc going to hit a wall in the improvement of system
perfo~uiance unless something
basic
changes.
First let's assume that the cache speed matches that of the processor, and specifically that it
scales with the processor speed. This is certainly true for on-chip cache, and allows us to
easily normalize all our results in terms of instruction cycle times (essentially saying
t c = 1
cpu cycle). Second, assume that the cache is perfect. That is, the cache never has a conflict
or capacity miss; the only misses are the compulsory ones. Thus ( 1 -p) is just the
probability of accessing a location that has never been referenced before (one can quibble
and adjust this for line size, but this won't affect the conclusion, so we won't make the
argument more complicated than necessary).
Now, although ( 1 -p) is small, it isn't zero_ Therefore as
t c
and
t m
diverge,
tavg
will grow
and system performance will degrade. In fact, it will hit a wall.
m20m
In most programs, 20-40% of the instructions reference memory [Hen90]. For the sake of
argument let's take the lower number, 20%. That means that, on average, during execution
every 5th instruction references memory. We will hit the wall when
tavg
exceeds 5
instruction times. At that point system performance is totally detcrminedby memory speed;
making the processor faster won't affect the wall-clock time to complete an application.
Alas, there is no easy way out of this. We have already assumed a perfect cache, so a bigger/
smarter one won't help.We're already using the full bandwidth of the memory, so
prefetching or other related schemes won't help either. We can consider other things that
might be done, but first let's speculate on when we might hit the wall.
Assume the compulsory miss rate is 1% or less [Hen90] and that the next level of the
memory hierarchy is currently four times slower than cache. If we assume that DRAM
speeds increase by 7% per year [Hen90] and use Baskett's estimate that microprocessor
performance is increasing at the rate of 80% per year [Bas91], the average number of cycles
per memory access will be 1.52 in 2000, 8.25 in 2005, and 98.8 in 2010. Under these
assumptions, the wall is less than a decade away.
Figures 1-3 explore various possibilities, showing projected trends for a set of perfect or
near-perfect caches. All our graphs assume that DRAM performance continues to increase
at an annual rate of 7%. The horizontal axis is various cpu/DRAM performance ratios, and
the lines at the top indicate the dates these ratios occur ff microprocessor performance (I J-)
increases at rates of 50% and 100% respectively. Figure 1 assumes that cache misses are
currently 4 times slower than hits; Figure l(a) considers compulsory cache miss rates of
less than 1% while Figure l(b) shows the same trends for caches with more realistic miss
rates of 2-10%. Figures 2 is a counterpart of Figure 1, but assumes that the current cost of
a cache miss is 16 times that of a hit.
Figure 3 provides a closer look at the expected impact on average memory access time for
one particular value of ~t, Baskett's estimated 80%. Even if we assume a cache hit rate of
99.8% and use the more conservative cache miss cost of 4 cycles as our starting point,
performance hits the 5-cycles-per-access wall in 11-12 years. At a hit rate of 99% we hit
the same wall within the decade, and at 90%, within 5 years.
Note that changing the starting point the "currant" miss/hit cost ratio m and the cache
miss rates
don't change the trends:
if the microprocessor/memory perfotL~,ance gap
continues to grow at a similar rate, in 10-15 years each memory access will cost, on
average, tens or even hundreds of processor cycles. Under each scenario, system speed is
dominated by memory performance.
Over the past thirty years there have been several predictions of the eminent cessation of
the rate of improvement in computer performance. Every such prediction was wrong. They
were wrong because they hinged on unstated assumptions that were overturned by
subsequent events. So, for example, the failure to foresee the move from discrete
components to integrated circuits led to a prediction that the speed of light would limit
computer speeds to several orders of magnitude slower than they are now.
m 21 m
Our prediction of the memory wall is probably wrong too but it suggests that we have
to start thinking "out of the box". All the techniques that the authors are aware of, including
ones we have proposed [McK94, McK94a], provide one-time boosts to either bandwidth
or latency. While these delay the date of impact, they don't change the fundamentals.
The most "convenient" resolution to the problem would be the discovery of a cool, dense
memory technology whose speed scales with that of processors. We ax~ not aware of any
such technology and could not affect its development in any case; the only contribution we
can make is to look for architectural solutions. These are probably all bogus, but the
discussion must start somewhere:
Can we drive the number of compulsory misses to zero? If we can't fix
t m,
then
the only way to make caches work is to drive p to 100% -- which means
clirninadng the compulsory misses. If all data were initialized dynamically, for
example, possibly the compiler could generate special "first write" instructions.
It is harder for us to imagine how to drive the compulsory misses for code to
zero.
Is it time to forgo the model that access time is uniform to all parts of the address
space? It is false for DSM and other scalable multiprocessor schemes, so why
not for single processors as well? If wc do this, can the compiler explicitly
manage a smaller amount of higher speed memory?
Arc there any new ideas for how to trade computation for storage?
Alternatively, can wc trade space for speed? DRAM keeps giving us plenty of
the former.
Ancient machines like the IBM 650 and Burroughs 205 used magnetic drum as
primaxy memory and had clever schemes for reducing rotational latency to
essentially zero m can we borrow a page from either of those books7
As noted above, the right solution to the problem of the memory wall is probably something
that we haven't thought of m but wc would like to see the discussion engaged. It would
appear that we do not have a great deal of dine.
m22. m
tOO
s ::,C
;
.ll /" ."
/ / .. p = 99.8~ f~//...'-el .......... p = 96.0~
.
10
/ / / ' ~ tO0
.//' ..'"
I/
.. .'"
ez 10
/ t ..
/ / o"
"~ o
.~s .o
cache miss/hit cost ratio
/I
.- oo
/1
,t ."
/I
"
;/
II ,"
// o."
_
_~_.'~.!'""~-"
I I I I I I I I I , I I I I I I I
cache miss/hit cost ratio
(a) Co)
Figure I Trends for a Current Cache Miss/Hit Cost Ratio of 4
IOOOO
IOOO-
el
b
L
I00-
b
i I0-
i : °
_
~
It- .50q5
1000(1
/'~/~'/~=,
5095
;I ."
: p -- 99.0qG
p = 94.0~1,
// .p = 99.8~ ." ,'-~----p = 98.0~
/
"" 10 .~" ."
o/'// o,"
j/ ."
jI
/I ."
.. o.
II
r w f"
,/!
o./ ."
/! •"
wf J r. ~
' I I I I I , I I I I I I I I 1 I
cache miss/hit cost ratio
lO0-
10-
if" .°"
w
/l p
,w i
/ /
r w
/I ,'
//
.f~/ °"
wl w
cache miss/hit cost ratio
(a) (b)
Figure 2 Trends for a Current Cache Miss/Hit Cost Ratio of 16
m23--
!0-
6-
el
I
: I ;
I I
................... J ............... ./-- p -- 90.0%
i I
! I ~ ; p: 99.0%
: I ,,~4-- p = 99.9%
I I
I / ,"
" I
I I ."
/ / ,"
i i °
~ S j"
j ."
I I It I I I I I I I I I I
year
Figure 3 Average Access Cost for 80% Annual Increase in Processor Performance
References
[Bas91]
[Hen90]
[MeK94]
[McK94a]
E Baskett, Keynote address. International Symposium on Shared. Memory
Multiprocessing, April 1991.
J.L. Hennessy and D.A. Patterson, Computer Architecture: a Quantitative
Approach, Morgan-Kaufrnan, San Mateo, CA, 1990.
S.A. McKc¢, et. al., "Experimental Implementation of Dynamic Access
Ordering", Pro¢. 27th Hawaii International Conference on System Sciences,
Maul, HI, January 1994.
S.A. McKe¢, et. al., "Increasing Memory Bandwidth for Vector
Computations", Pro¢. Conference on Programming Languages and System
Architecture, Zurich, March 1994.
m 24
... The importance of the main memory in the overall system's design [1]- [3] drives significant effort for memory system benchmarking, simulation, and memory-related application profiling. Although these three memory performance aspects are inherently interrelated, they are analyzed with distinct and decoupled tools. ...
... Each store instruction, therefore, does not correspond to a single memory write, but to one read and one write. 1 The Mess bandwidth-latency curves are illustrated in Figure 1 (middle). The curves with different composition of read and write traffic are plotted with different shades of blue. ...
... Each curve is constructed based on tens of measurement points that cover the whole range of memory-traffic intensity. Figure 1 (top) illustrates the construction process for one of 1 Memory traffic with more than 50% of writes can be generated with nontemporal (streaming) stores that directly store data in the main memory. These instructions, however, are not supported in all HPC platforms. ...
Preprint
Full-text available
The Memory stress (Mess) framework provides a unified view of the memory system benchmarking, simulation and application profiling. The Mess benchmark provides a holistic and detailed memory system characterization. It is based on hundreds of measurements that are represented as a family of bandwidth--latency curves. The benchmark increases the coverage of all the previous tools and leads to new findings in the behavior of the actual and simulated memory systems. We deploy the Mess benchmark to characterize Intel, AMD, IBM, Fujitsu, Amazon and NVIDIA servers with DDR4, DDR5, HBM2 and HBM2E memory. The Mess memory simulator uses bandwidth--latency concept for the memory performance simulation. We integrate Mess with widely-used CPUs simulators enabling modeling of all high-end memory technologies. The Mess simulator is fast, easy to integrate and it closely matches the actual system performance. By design, it enables a quick adoption of new memory technologies in hardware simulators. Finally, the Mess application profiling positions the application in the bandwidth--latency space of the target memory system. This information can be correlated with other application runtime activities and the source code, leading to a better overall understanding of the application's behavior. The current Mess benchmark release covers all major CPU and GPU ISAs, x86, ARM, Power, RISC-V, and NVIDIA's PTX. We also release as open source the ZSim, gem5 and OpenPiton Metro-MPI integrated with the Mess memory simulator for DDR4, DDR5, Optane, HBM2, HBM2E and CXL memory expanders. The Mess application profiling is already integrated into a suite of production HPC performance analysis tools.
... T HE rapid development of artificial intelligence (AI) has created a surging demand for computational resources, such as central processing units (CPUs), graphics processing units (GPUs), field-programmable gate arrays (FPGAs), and so on. Besides, the substantial parameters associated with deep learning neural networks (DNNs) have led to huge data movement between the memory and the processor in the von Neumann architecture, severely impacting the performance and energy efficiency (EE) of AI systems [3]. Computationin-memory (CIM) technology configures memory as computational storage units to realize multiply-accumulate (MAC) operations with less data movement, improving both performance and EE [6], [7]. ...
... respectively, as shown in Table II. Table III compares the proposed 4T1T SRAM-based CIM engine with the state of the art [1], [2], [3], [4], [6], [7], [8]. The 4T1T SRAM cell offers the capability of NDR like the 8T cell but with a 10% smaller cell area than the standard 6T cell. ...
Article
Full-text available
Computing-in-memory (CIM) promises high energy efficiency (EE) and performance in accelerating the feed-forward (FF) and back-propagation (BP) processes of deep neural networks (DNNs) with less data movement and high parallelism. However, challenges still lie in large memory cells, network mapping, and IR-drop variation to realize efficient CIM implementation. In this work, a 28-nm 36 Kb static random-access memory (SRAM) CIM engine with nondestructive-read (NDR) cell and weight update energy saving is used for multiply-accumulate (MAC) acceleration in artificial intelligence (AI) inference and train applications. A 4T1T SRAM bit-cell is proposed with NDR and records the smallest cell size of 0.173 μm2. The power-on self-load-0 feature of the 4T1T cell saves the weight update energy and latency for writing 0. The shared-path dual-mode read (SPDMR) brings fewer circuit overheads to support both FF and BP paths. The bit-interleaving weight mapping (BIWM) speeds up the BP path without slowing FF. IR-drop-aware adaptive clampers (IRDAA-Cs) with hierarchical read word-lines (RWLs) and read bit-lines (RBLs) apply possibly accurate voltages on near/far cells. The engine achieves an EE of 263.1/412.1 TOPS/W, as well as an area efficiency (AE) of 2.5/4.9 TOPS/mm2 for FF/BP process @1-bit weight/activation with 74.4%–78.3% reduction in weight update energy.
... Memory latency seldom improves [65]. In fact, recent trends increase capacity over cost by introducing slower memories that are denser (e.g., NVM) or reside far (e.g., as in RDMA or CXL). ...
Preprint
Full-text available
This paper presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
... With the development of the information industry, the demand for information storage and new forms of computing is rapidly growing. In the von Neumann architecture, the speed of memory data access cannot keep up with that of CPU data processing, leading to a more serious "memory wall" problem [1]. To this end, developing new types of memory devices with high access rates is significant for the next-generation ultrafast computing systems. ...
Article
Full-text available
Novel computing technologies that imitate the principles of biological neural systems may serve low power consumption with significant cognitive and learning advantages. The development of memristors with non-volatile memory characteristics has opened up new applications in neuromorphic circuits and adaptive systems. However, conventional metal oxide memristor devices are generally based on oxygen vacancy or metal-ion conductive filament mechanisms that make it hard to realize the function of neuromorphic computing. Herein, we demonstrate that CrSBr nanosheet-based memristor exhibits high durability, low power consumption, and the capacity to achieve multiple reproducible resistance states, displaying resistive switching performance. Furthermore, we successfully simulated various synaptic behaviors, such as short-term/long-term plasticity, paired-pulse facilitation (PPF), and spike-timing-dependent plasticity (STDP), revealing the capability of CrSBr memristors to flexibly meet the demands of complex neuromorphic computing applications. Our work makes CrSBr a promising candidate which is considered to be the behavior of an ideal synaptic biomimetic device in future computing systems.
... A popular method of improving computational performance is to reduce the bit count of the floatingpoint representations used for training [8][9][10]. This is because reading memory is the main bottleneck in modern processors, a problem known as the "memory wall" [11,12]. Reducing the number of bits that each floating-point number uses can, therefore, accelerate the computation in proportion to the amount of memory reduced. ...
Preprint
Full-text available
The massive computational costs associated with large language model (LLM) pretraining have spurred great interest in reduced-precision floating-point representations to accelerate the process. As a result, the BrainFloat16 (BF16) precision has become the de facto standard for LLM training, with hardware support included in recent accelerators. This trend has gone even further in the latest processors, where FP8 has recently been introduced. However, prior experience with FP16, which was found to be less stable than BF16, raises concerns as to whether FP8, with even fewer bits than FP16, can be a cost-effective option for LLM training. We argue that reduced-precision training schemes must have similar training stability and hyperparameter sensitivities to their higher-precision counterparts in order to be cost-effective. However, we find that currently available methods for FP8 training are not robust enough to allow their use as economical replacements. This prompts us to investigate the stability of reduced-precision LLM training in terms of robustness across random seeds and learning rates. To this end, we propose new evaluation techniques and a new metric for quantifying loss landscape sharpness in autoregressive language models. By simulating incremental bit reductions in floating-point representations, we analyze the relationship between representational power and training stability with the intent of aiding future research into the field.
... and artificial intelligence (AI) applications, including language processing, image recognition, computer vision, robotics, etc., where NM can enable highly-efficient computing involving ultra energy-efficient devices. Additionally, NM computing, due to its integrated computational-and memory units, can also eradicate the memory wall 3 problem, which has been estimated to consume the majority of the energy and time in certain computation tasks and is further going to intensify in the age of big data 4 . ...
Article
Full-text available
Brain-like energy-efficient computing has remained elusive for neuromorphic (NM) circuits and hardware platform implementations despite decades of research. In this work we reveal the opportunity to significantly improve the energy efficiency of digital neuromorphic hardware by introducing NM circuits employing two-dimensional (2D) transition metal dichalcogenide (TMD) layered channel material-based tunnel-field-effect transistors (TFETs). Our novel leaky-integrate-fire (LIF) based digital NM circuit along with its Hebbian learning circuitry operates at a wide range of supply voltages, frequencies, and activity factors, enabling two orders of magnitude higher energy-efficient computing that is difficult to achieve with conventional material and/or device platforms, specifically the silicon-based 7 nm low-standby-power FinFET technology. Our innovative 2D-TFET based NM circuit paves the way toward brain-like energy-efficient computing that can unleash major transformations in future AI and data analytics platforms.
... However, the speed improvement of memory technology has lagged behind the speed improvement of processing units, resulting in overall system performance degradation. This problem is named von Neumann bo leneck or memory wall [1]. Another important problem is that the processing unit and memory are physically separated, so moving data between them consumes significant energy, causing the overall system to overload. ...
Article
Full-text available
Although the von Neumann architecture-based computing system has been used for a long time, its limitations in data processing, energy consumption, etc. have led to research on various devices and circuit systems suitable for logic-in-memory (LiM) computing applications. In this paper, we analyze the temperature-dependent device and circuit characteristics of the floating gate field effect transistor (FGFET) source drain barrier (SDB) and FGFET central shallow barrier (CSB) identified in previous papers, and their applicability to LiM applications is specifically confirmed. These FGFETs have the advantage of being much more compatible with existing silicon-based complementary metal oxide semiconductor (CMOS) processes compared to devices using new materials such as ferroelectrics for LiM computing. Utilizing the 32 nm technology node, the leading-edge node where the planar metal oxide semiconductor field effect transistor structure is applied, FGFET devices were analyzed in TCAD, and an environment for analyzing circuits in HSPICE was established. To seamlessly connect FGFET-based devices and circuit analyses, compact models of FGFET-SDB and -CSBs were developed and applied to the design of ternary content-addressable memory (TCAM) and full adder (FA) circuits for LiM. In addition, depression and potential for application of FGFET devices to neural networks were analyzed. The temperature-dependent characteristics of the TCAM and FA circuits with FGFETs were analyzed as an indicator of energy and delay time, and the appropriate number of CSBs should be applied.
Article
Full-text available
This study discusses the feasibility of Ferroelectric Capacitors (FeCaps) and Ferroelectric Field-Effect Transistors (FeFETs) as In-Memory Computing (IMC) elements to accelerate machine learning (ML) workloads. We conducted an exploration of device fabrication and proposed system-algorithm co-design to boost performance. A novel FeCap device, incorporating an interfacial layer (IL) and Hf0.5Zr0.5O2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {Hf}_{0.5}\text {Zr}_{0.5}\text {O}_2$$\end{document} (HZO), ensures a reduction in operating voltage and enhances HZO scaling while being compatible with CMOS circuits. The IL also enriches ferroelectricity and retention properties. When integrated into crossbar arrays, FeCaps and FeFETs demonstrate their effectiveness as IMC components, eliminating sneak paths and enabling selector-less operation, leading to notable improvements in energy efficiency and area utilization. However, it is worth noting that limited capacitance ratios in FeCaps introduced errors in multiply-and-accumulate (MAC) computations. The proposed co-design approach helps in mitigating these errors and achieves high accuracy in classifying the CIFAR-10 dataset, elevating it from a baseline of 10% to 81.7%. FeFETs in crossbars, with a higher on-off ratio, outperform FeCaps, and our proposed charge-based sensing scheme achieved at least an order of magnitude reduction in power consumption, compared to prevalent current-based methods.
  • E Baskett
E Baskett, Keynote address. International Symposium on Shared. Memory Multiprocessing, April 1991.