ArticlePDF Available

Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for Energy-Efficient GPUs

Authors:

Abstract and Figures

Modern Graphics Processing Units (GPUs) require large hardware resources for massive parallel thread executions. In particular, modern GPUs have a large register file composed of Static Random Access Memory (SRAM). Due to the high leakage current of SRAM, the register file consumes approximately 20% of the total GPU energy. The energy efficiency of the register file becomes more critical as the throughput of GPUs increases. For more energy-efficient GPUs, the usage of non-volatile memory such as Spin-Transfer Torque Magnetic Random Access Memory (STT-MRAM) as the GPU register file has been studied extensively. STT-MRAM requires a lower leakage current compared to SRAM and provides an appropriate read performance. However, using STT-MRAM directly in the GPU register file causes problems in performance and endurance because of complicated write procedures and material characteristics. To overcome these challenges, we propose a novel register file architecture and its management system for GPUs, named Hi-End, which exploits the data locality and compressibility of the register file. For STT-MRAM-based GPU register files, Hi-End increases the data write performance and endurance by caching and data compression, respectively. In our evaluation, Hi-End enhances the energy efficiency of a GPU register file by 70.02% and reduces the write operations by up to 95.98% with negligible performance degradation compared to SRAM-based register files.
Content may be subject to copyright.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3008719, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
Hi-End: Hierarchical, Endurance-Aware
STT-MRAM-Based Register File for
Energy-Efficient GPUs
WON JEON1, (Student Member, IEEE), JUN HYUN PARK1, YOONSOO KIM2, GUNJAE KOO3,
(Member, IEEE), AND WON WOO RO1, (Senior Member, IEEE)
1Electrical and Electronic Engineering Department, Yonsei University, Seoul, South Korea
2SSD System Engineering Team, NAND Solution Division, SK Hynix, Seongnam, South Korea
3Department of Computer Science and Engineering, Korea University, Seoul, South Korea
Corresponding author: Won Woo Ro (e-mail: wro@yonsei.ac.kr).
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No.
NRF-2018R1A2A2A05018941) and Institute of Information & Communication Technology Planning & Evaluation (IITP) grant funded by
the Korea government (MSIT) (No. 2019-0-00533, Research on CPU vulnerability detection and validation).
ABSTRACT Modern Graphics Processing Units (GPUs) require large hardware resources for massive
parallel thread executions. In particular, modern GPUs have a large register file composed of Static
Random Access Memory (SRAM). Due to the high leakage current of SRAM, the register file consumes
approximately 20% of the total GPU energy. The energy efficiency of the register file becomes more
critical as the throughput of GPUs increases. For more energy-efficient GPUs, the usage of non-volatile
memory such as Spin-Transfer Torque Magnetic Random Access Memory (STT-MRAM) as the GPU
register file has been studied extensively. STT-MRAM requires a lower leakage current compared to
SRAM and provides an appropriate read performance. However, using STT-MRAM directly in the GPU
register file causes problems in performance and endurance because of complicated write procedures and
material characteristics. To overcome these challenges, we propose a novel register file architecture and its
management system for GPUs, named Hi-End, which exploits the data locality and compressibility of the
register file. For STT-MRAM-based GPU register files, Hi-End increases the data write performance and
endurance by caching and data compression, respectively. In our evaluation, Hi-End enhances the energy
efficiency of a GPU register file by 70.02% and reduces the write operations by up to 95.98% with negligible
performance degradation compared to SRAM-based register files.
INDEX TERMS Graphics processing unit, register file, spin-transfer torque magnetic random access
memory, data compression, energy efficiency, endurance, chip area.
I. INTRODUCTION
GRAPHICS Processing Units (GPUs) have emerged as
the most important computing platform since through-
put applications, such as artificial intelligence workloads,
became popular. Many emerging applications, such as image
classification, audio synthesis, and recommender systems,
are accelerated using GPUs. Because modern GPUs are
deployed from mobile systems to high-performance data
centers, the power consumption of GPUs becomes a critical
issue. A GPU core, called a streaming multiprocessor (SM),
provides a large register file for massive thread-level paral-
lelism and fast context switching. For instance, the recent
NVIDIA Ampere architecture includes a 27,648 KB register
file whereas the previous Fermi architecture has a 2,048 KB
register file [1], [2]. To provide sufficient on-chip memory
performance, the register file in a GPU is implemented with
Static Random Access Memory (SRAM) which requires
a significant amount of power and area. Previous studies
revealed that the GPU register file consumes 15–20% of
the total energy consumption on NVIDIA GPU devices [3],
[4]. As a modern GPU provides a larger register file, the
power consumption and the area overhead by the register file
become more critical.
As Complementary Metal–Oxide–Semiconductor (CMOS)
scales down, the static power consumption by leakage
current occupies a considerable proportion of the entire
power budget. To reduce the static power of memory cells,
researchers have investigated the Non-Volatile Memory
VOLUME 4, 2016 1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3008719, IEEE Access
W. Jeon et al.: Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for Energy-Efficient GPUs
(NVM) technologies such as Spin-Transfer Torque Magnetic
Random Access Memory (STT-MRAM). The STT-MRAM
cells provide similar read performance and a significantly
low leakage power compared to the conventional SRAM
cells. However, the STT-MRAM cells are affected by inferior
write performance and endurance. The long write latency
and the limited write endurance of STT-MRAM are critical
obstacles preventing it from directly substituting SRAM cells
despite its leakage power advantage. Previous studies pro-
posed architectural solutions to overcome these performance
hurdles of STT-MRAM [4]–[13]. For instance, SRAM-based
write buffers have been utilized to minimize the performance
degradation from STT-MRAM’s long write latency [4], [5].
However, despite their importance, none of the previous
studies have considered the low endurance issues of STT-
MRAM cells used in place of SRAM cells.
In this paper, we propose a novel STT-MRAM-based
register file architecture for GPUs, named Hierarchical,
Endurance-aware STT-MRAM-based register file (Hi-End).
Hi-End employs a hierarchical structure and a compression
technique to solve the problems of STT-MRAM. We de-
signed Hi-End based on two observations. The first obser-
vation is that most read/write accesses to the GPU register
file are concentrated to a small number of registers. Hence,
the GPU register file accesses exhibit a high locality. The
second observation is that most write accesses to the GPU
register file are compressible. Hence, the number of register
file bank activation can be effectively reduced if the data can
be compressed.
To fully exploit the locality of the GPU register file, we
adopted a register cache and a delay buffer. The register
cache works as a write cache for the slow STT-MRAM
register file. The operation of the register cache is similar to
that of the conventional cache except it only caches write-
back data. When the cache block is evicted from the register
cache, the delay buffer operates as a station-like buffer that
maintains the evicted block until it is written to the STT-
MRAM register file. When evicted data are written to register
files, data are compressed to minimize the dynamic power
and increase the endurance of STT-MRAM. Since the data
are compressed, we can store the same data with a small
number of register banks. Hi-End utilizes unused register
banks to enhance the endurance of STT-MRAM cells. Parts
of the STT-MRAM register banks are accessed sequentially
to write data. Consequently, the number of write operations
for each STT-MRAM cell is decreased owing to the compres-
sion ratio; hence, endurance is improved accordingly.
The contributions of this paper are summarized as follows:
By exploiting the locality of the GPU register file, we
propose a register cache and a delay buffer architecture
to minimize direct access to the STT-MRAM-based
register file. Using the register cache and the delay
buffer architecture, we can reduce the STT-MRAM
write overhead and take advantage of the STT-MRAM
register file.
GPU
L2 Shared Cache (Last-Level Cache)
Memory
Controller 0 Memory
Controller K-1
Off-Chip DRAM
SM N-1
Register File
Load/Store Unit
CUDA Cores
L1 Data Cache
Scratchpad Memory
I-Cache/Fetch/Decode
Warp Scheduler
SM 0
RF
LSU
L1D$
Scrat
I$/F/D
WS
SM 1
RF
LSU
L1D$
Scrat
I$/F/D
WS
Warp Scheduler
Bank Arbiter
Operand Collector
CUDA Cores
Bank 1
Bank 0
Bank M-1
Bank 2
Register File Architecture
CoresCores
FIGURE 1. Baseline GPU and register file architecture.
We propose a compression technique to employ the
value similarity of GPU register file and decrease the
high dynamic power of the STT-MRAM register file.
Furthermore, bank-level wear-leveling utilizes concen-
trated bank access to increase endurance with low-cost
architecture.
With cycle-accurate simulations, we show that Hi-End
reduces the energy consumption of GPU register files
by 70.02% with a 0.86% of performance drop compared
to SRAM-based register file GPUs. Hi-End reduces
the number of write operations to the most frequently
written register file bank by 95.98%.
The rest of this paper is organized as follows. Section II
presents the baseline GPU architecture, register file orga-
nization, and characteristics of STT-MRAM. Section III
introduces the motivational data and opportunities for our
study. In Section IV, we present the detailed architecture and
operation of Hi-End. The experimental results of Hi-End,
such as the performance and energy efficiency, are described
in Section V. In Section VI, we introduce related studies, and
we conclude our paper in Section VII.
II. BACKGROUND
A. BASELINE GPU ARCHITECTURE
In this paper, we use NVIDIA GPU terminologies for con-
sistency [1], [2], [14]–[18]. Our baseline GPU model com-
prises several SMs, and each SM contains dozens of cores;
therefore, the GPU device contains hundreds to thousands of
computing cores. An SM can execute thousands of threads
at a time (e.g. 2,048 in recent GPUs). To provide sufficient
memory bandwidth to the massive number of concurrently
executing threads, GPUs exploit a large number of register
files. Consequently, the register file occupies a large portion
of the entire on-chip memory in GPUs.
Fig. 1 describes the baseline GPU and the register file
architecture used in this work. The on-chip memory systems
of GPUs have hierarchical structures with a register file, L1
data cache, scratchpad memory (or shared memory), and L2
2VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3008719, IEEE Access
W. Jeon et al.: Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for Energy-Efficient GPUs
0
0.2
0.4
0.6
0.8
1
2DCONV
2MM
3DCONV
3MM
BFS
BICG
BS
CONV
FDTD
GS
HW
LBM
LIB
MUM
ND
SP
SC
AVG
Energy Consumption Ratio
Leakage Dynamic (read) Dynamic (write)
FIGURE 2. Energy consumption of SRAM-based GPU register file.
shared cache (or last-level cache). The register file is used
as operands for instructions and additional memory space
for context switching. The L1 data cache and the scratchpad
memory are located inside of each SM and provide private
cache memory space. In particular, the scratchpad memory
stores user-defined data for higher throughput. The memory
space for the L1 data cache and the scratchpad memory are
often exchangeable. Finally, the L2 shared cache is located
outside of the SMs and provides shared memory space for
all threads operating on a GPU. The L2 shared cache is
connected with an off-chip main memory system through
memory controllers.
B. GPU REGISTER FILE ARCHITECTURE
Conventional GPUs provide large SRAM-based register files.
As shown in Fig. 1, the register file in a single SM is
composed of multiple banks in order to provide parallel bank
access and higher bandwidth. The baseline GPU model used
in this study employs a 128 KB register file composed of 64
banks. Each register file bank contains 256 entries, and each
entry is 64-bit wide; hence, a single entry stores two 32-bit
data. A GPU hardware executes multiple threads in a group,
known as a warp, which comprises 32 threads that share
the same program counter and execute identical instructions
simultaneously. To access multiple registers concurrently, the
registers for 32 threads in the same warp are allocated to the
same entry location that exists across multiple banks [19]. For
instance, assuming the size of a register r0 is 4 bytes, r0 of
32 threads in warp0 can be stored in entry0 of register
banks from 0 to 15. Hence, multiple threads in a single warp
access tens of registers simultaneously to minimize clock
cycles required to access the register file. However, a large
amount of dynamic and static power is dissipated in the
register file, as a single warp instruction can activate dozens
of register file banks [19].
C. CHARACTERISTICS OF STT-MRAM
High leakage current in SRAM-based memory cells is the
major source of large power dissipation in the GPU register
file. As shown in Fig. 2, approximately 59.6% of the total
GPU register file energy is consumed by the leakage current
in the SRAM cells under 32 nm CMOS technology. De-
tailed simulation methodologies for the figure are provided
in Section V. Using the emerging NVM cells is a prominent
approach, as most NVM technologies exhibit extremely low
leakage current. For instance, STT-MRAM provides a similar
read lartency with a smaller cell size and an extremely low
leakage current compared to SRAM-based memory cells.
However, NVM technologies still have drawbacks in terms of
write performance and endurance. Other NVM technologies
such as Phase Change Memory (PCM) and Resistive Ran-
dom Access Memory (ReRAM) exhibit drawbacks, such as
extremely low endurance or long access latency; therefore,
they are not suitable for use as fast on-chip memories [20]. In
this work, we use STT-MRAM to build an energy-efficient
register file for GPUs.
TABLE 1. Characteristics of SRAM and STT-MRAM [4], [5], [21]
Parameter SRAM STT-MRAM
Cell factor (F2) 146 57.5
Area (mm2) 0.194 0.038
Read latency (cycle) 1 1
Write latency (cycle) 1 4
Read energy (pJ/bit) 0.203 0.239
Write energy (pJ/bit) 0.191 0.300
Leakage power (mW) 248.7 16.2
Endurance 1016 1013
We compare the characteristics of STT-MRAM with the
conventional SRAM cells in Table 1. The parameters are
measured using NVSim [4], [5], [21]. More detailed ex-
perimental methodologies are provided in Section V. The
endurance parameter of STT-MRAM is configured based on
the actual usage environment [22]. Even though STT-MRAM
exhibits the same read latency, the longer write access latency
(four times slower) and higher write energy consumption are
critical disadvantages of STT-MRAM cells. Another critical
drawback of STT-MRAM is its low write endurance, which is
1,000 times lower compared to that of SRAM. In particular,
the register file is the memory space that is closest to the
computational unit; therefore, data read and write occurs
most frequently. Consequently, the low write endurance of
STT-MRAM results in the short lifetime of the entire GPU
system.
To investigate the effect of STT-MRAM-based register
files on GPU performance and lifetime, we configured the
STT-MRAM parameters listed in Table 1 into a cycle-
accurate GPU simulator [23]. In the study, we replaced the
SRAM-based register file in the GPU with an STT-MRAM-
based register file without any additional techniques. Our
simulation study reveals that the performance decreases by
18% due to slower register file write operations, and the GPU
lifetime becomes only 11 months due to the limited register
file write endurance. Both the performance drop and short
lifetime are impractical for GPUs.
VOLUME 4, 2016 3
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3008719, IEEE Access
W. Jeon et al.: Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for Energy-Efficient GPUs
0
0.2
0.4
0.6
0.8
1
0
20
40
60
80
100
Fermi
(GF100)
Kepler
(GK110)
Maxwell
(GM200)
Pascal
(GP100)
Volta
(GV100)
Turing
(TU102)
Ampere
(GA100)
Register File Portion
On-Chip Memory Size (MB)
Register File (RF) L1D + Scratchpad L2 RF/Total
FIGURE 3. On-chip memory size of NVIDIA GPUs by generations.
0
1
2
3
4
5
6
L1 Data Cache +
Scratchpad Memory L2 Cache Register File
On-Chip Memory Size
per Shader Unit (KB)
Fermi Kepler Maxwell Pascal Volta Turing Ampere
FIGURE 4. On-chip memory size per shader unit.
III. MOTIVATION
As introduced in the previous section, low write performance
and endurance are critical obstacles hindering the usage of
STT-MRAM cells as on-chip memory structures, such as
register files and L1 caches. In this section, we address the
high energy consumption problem of GPU register file, and
analyze the unique patterns of register file read and write
operations in GPU applications. In addition, we analyze the
similarities in the register data accessed by adjacent threads
within the same warp. Furthermore, we demonstrate that
the register file data can be effectively compressed to a
smaller size by exploiting the data similarity observed in the
register data. Such observations motivate the development
of architectural solutions that can employ STT-MRAM to
achieve an energy-efficient register file design.
A. ENERGY CONSUMPTION OF GPU REGISTER FILE
To achieve a high throughput, GPUs execute hundreds of
thousands of threads concurrently. For example, the recent
NVIDIA Ampere GA100 executes up to 221,184 threads
in one GPU device [2]. The large number of register files
has been a key enabler for the high throughput of GPUs.
We analyzed the on-chip memory size and a portion of the
register file for NVIDIA GPUs over the last decade [1], [2],
[14]–[18]. As shown in Fig. 3, the register file constituted
approximately 60% of the total on-chip memory. Despite
SRAM’s high energy consumption, the register file occupies
a large portion of the on-chip memory system. The relatively
0
0.2
0.4
0.6
0.8
1
2DCONV
2MM
3DCONV
3MM
BFS
BICG
BS
CONV
FDTD
GS
HW
LBM
LIB
MUM
ND
SP
SC
AVG
Access Ratio
Top-5 Most Written Register Top-5 Most Read Register
FIGURE 5. Read and write ratios of Top-5 most accessed registers.
low register file portion (approximately 30.1%) of Ampere
is due to a substantial increase in the L2 caches. The sizes of
L2 caches in NVIDIA GPUs are increased for workloads that
have non-cacheable dataset for conventional L2 caches, such
as artificial intelligence and high performance computing
workloads.
The register file size of NVIDIA GPUs has increased
steadily. Fig. 4 shows the on-chip memory size per shader
unit (or CUDA core). Except the unique L2 cache in Ampere,
the register file size per shader unit is always the largest
among various on-chip memory structures. In fact, the regis-
ter file size per shader unit has remained at 4 KB since Pascal
GPUs. The number of computing cores in a GPU device will
continue to increase to achieve higher throughput. Hence, the
register file size is expected to increase accordingly. How-
ever, large SRAM-based register files consume a substantial
amount of static energy, as presented in Section II-C. As
a result, the energy consumption of register files in GPUs
becomes an increasingly important issue.
B. LOCALITY IN GPU REGISTER FILE
We discovered that the register file accesses possess locality
based on the register IDs. To reveal this register locality,
we measured the read and write access counts by register
IDs using a cycle-accurate GPU simulator [23]. Detailed
simulation methodology is provided in Section V. For in-
stance, if an instruction requires r0 and r1 as operands
and writes the execution result to r2, we increment the
read counts of r0 and r1 and the write count of r2 by
1, respectively. Subsequently, we sorted the register access
counts based on the register IDs. This measurement reveals
two important characteristics in GPU register file accesses.
First, most register file accesses are concentrated on several
register IDs. On average, more than 80% of register write
accesses are from the Top-5 most written register IDs as
shown in Fig. 5. Second, the write access counts from these
register IDs are similar to the read counts. This study reveals
that high locality and concentration are observed in GPU
register file accesses.
4VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3008719, IEEE Access
W. Jeon et al.: Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for Energy-Efficient GPUs
Baseline New Structure for Hi-End
Operand Collectors
Writeback
Warp Scheduler
Delay Buffer
Compression Unit
Decompression Unit
Register Cache
Crossbar
Issue
Delay buffer hit
Operand information
SIMD Execution Units
STT-MRAM Register File
Bank 0
(256 x 64-bit)
Bank 1
Bank 2
Bank 63
WID
V
Reg ID
R
Operands
V
Reg ID
R
Operands
V
Reg ID
R
Operands
WID
V
Reg ID
R
Operands
V
Reg ID
R
Operands
V
Reg ID
R
Operands
WID
V
Reg ID
R
Operands
V
Reg ID
R
Operands
V
Reg ID
R
Operands
Bank Arbiter
Compression Indicator
Bank-level Wear-leveling
Register cache hit
FIGURE 6. Hi-End architecture.
C. DATA SIMILARITY IN GPU REGISTER FILE
The other characteristics observed in the GPU register file
data is that the data written by the adjacent threads within
the same warp have similar values. Hence, this type of data
can be easily compressed using simple compression methods,
such as the Base-Delta-Immediate (BDI) algorithm [24]. The
BDI compression method can perform fast and simple com-
pression and decompression processes. A BDI compressor
extracts two parameters, called base and delta, from multiple
sets. The data to be compressed are split into multiple pieces,
and the first piece is set as the base. Subsequently, the
difference from this base is computed for other pieces. These
differences are represented as delta for other values. For
instance, 128-byte register data composed of 32 values of 4-
byte can be compressed to 35-byte data when BDI compres-
sion is applied using 4-byte base and 1-byte delta resolutions.
Furthermore, in case all 32 values are the same, the 128-
byte register data can be compressed to 4-byte, as the delta
is 0-byte. Hence, by applying BDI compression, read/write
accesses for multiple memory banks can be reduced.
The original BDI algorithm exploits multiple bases and
corresponding deltas to achieve optimal compression ratio.
However, in this study, we fixed the base resolution to 4-byte
to simplify the hardware design. Furthermore, we restricted
the deltas to only have 0-byte, 1-byte, and 2-byte resolutions.
By applying these limitations we can minimize clock cycles
and power consumption of the compression and decompres-
sion units. In this work, if the data cannot be compressed us-
ing such fixed base and delta resolutions, the data will remain
uncompressed. We calculated the ratio of register write that
can be compressed using the BDI algorithm to demonstrate
the value similarity of the GPU register file. In fact, more
than 62% of the register write can be compressed using the
BDI algorithm. Furthermore, recent studies revealed that the
register file can be effectively compressed using the BDI
compression algorithm [4], [19].
IV. HIERARCHICAL, ENDURANCE-AWARE STT-MRAM
REGISTER FILE ARCHITECTURE (HI-END)
We propose an energy-efficient STT-MRAM-based register
file for GPUs. Our proposed register file architecture exploits
STT-MRAM as a primary memory unit for register data
to reduce the static power dissipation. To overcome STT-
MRAM’s poor write performance, we apply a hierarchical
structure that employs a small SRAM as a cache for the
STT-MRAM register file. It is effective because of the strong
locality observed in the GPU’s register file data, as men-
tioned in Section III-B. In addition, we propose an efficient
wear-leveling mechanism using register data compression
to extend the lifetime of STT-MRAM cells. The detailed
architecture will be described in the following subsections.
A. OVERALL ARCHITECTURE OF HI-END
In this section, we describe the detailed architecture of the
proposed STT-MRAM-based hierarchical, endurance-aware
register file, called Hi-End. The overall architecture of Hi-
End is depicted in Fig. 6. Hi-End includes additional com-
ponents (register cache,delay buffer,compression unit, and
bank-level wear-leveling) on the baseline GPU register file.
The newly added blocks are shown in gray in the figure. Hi-
End employs a hierarchical structure that exploits an SRAM-
based write cache to hide the long write latency of the STT-
MRAM register file and reduce write counts to the STT-
MRAM cells. The SRAM-based write cache works as a
gateway for register writes to the STT-MRAM register file.
Hence, for register writes, the register values are first written
to the register cache. The GPU’s register data exhibits strong
locality (see Section III-B); therefore, the frequently written
register data can remain for a long time in the register cache.
Once the cached data are evicted from the register cache, the
evicted data are compressed using the compression unit. As
the size of the register data is reduced by the compression
step, the register data can occupy less banks of the STT-
VOLUME 4, 2016 5
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3008719, IEEE Access
W. Jeon et al.: Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for Energy-Efficient GPUs
Write Request
Evicted from cache
(b) Hi-End write operation.
Read Request
(a) Hi-End read operation.
Register Cache
Cache Hit Cache Miss
Delay Buffer
Evicted Block
Register File
Write Access
Register Cache
Cache Hit Cache Miss
Delay Buffer
Buffer MissBuffer Hit
Register File
Read Access
FIGURE 7. Read and write operations in Hi-End.
MRAM register file. Furthermore, Hi-End applies bank-level
wear-leveling (BWL) to increase the lifetime of the STT-
MRAM cells. BWL equally distributes the write requests
to multiple banks of the STT-MRAM register file. Hi-End
employs the delay buffer between the register cache and the
bank arbiter to compensate the additional cycles from the
compression unit and slow STT-MRAM write operation. We
will describe the detailed structures and mechanisms of Hi-
End components in the following sections.
Hierarchical register file: Because Hi-End exploits the
hierarchical register file, the requested register data can be
found in the various levels of the memory units (register
cache, delay buffer, and STT-MRAM register file). Fig. 7
summarizes the read and write operations in Hi-End.
The GPU utilizes the operand collector to read register
data from the unified register file. When a warp issues an
instruction that accesses register data, the warp scheduler
passes operand information required for register file accesses
(see Fig. 6). Each operand collector contains the operand
information, such as warp IDs, valid flags, register IDs, and
ready flags. Such operand information is sent to Hi-End, to
access register data from the hierarchical register file.
For a read operation, Hi-End first searches the requested
data in the register cache. Since the register cache has an
SRAM-based high performance cache structure, Hi-End can
quickly service the register data to the operand collector if
the requested data is found in the register cache. Second, the
requested data can be found in the delay buffer if the data is
evicted from the register cache and not written completely
to the STT-MRAM register file. When the requested data
does not exist in both the register cache and the delay buffer,
the data is read from the STT-MRAM register file. In this
case, if the data stored in the compressed form, the bank
arbiter activates and accesses the smaller number of banks.
The operand collector obtains the original data form via the
decompression unit. Due to the hierarchical structure, the
register read latency can degrade the GPU performance when
the target data is in the STT-MRAM register file. However,
as most read requests are handled in the register cache, the
read delay is negligible in Hi-End. A detailed analysis on the
V
Tag Data
11-bit 1024-bit
Modulo
256
Adder
Reg ID
(5-bit)
Warp ID
(6-bit) Shift
Left 5
Index
Tag
Hit
11-bit
Tag Calculation
=
Tag Matching
FIGURE 8. Register cache structure.
effect of read latency on GPU performance is presented in
Section V-C.
Once the Single Instruction Multiple Data (SIMD) exe-
cution units complete arithmetic operations, a register write
request occurs. Hi-End first searches the register cache for the
write request, and when a cache hit occurs, the target cache
line is updated with the new value. Otherwise, the write-
back data is newly allocated in the register cache. The data
evicted from the register cache must be stored in the STT-
MRAM register file if the register cache is already full with
the data that was previously written back; hence, the evicted
data enters the delay buffer. Subsequently, the evicted data
is compressed by the compression unit and written to the
STT-MRAM register file. This mechanism is similar to the
conventional cache write-back policy that employs the store
buffer except that every evicted data is written to the slower
STT-MRAM register file. Once the data is compressed, the
bank arbiter selects the banks to be activated for register write
operation. Hi-End applies the BWL technique in this step
to improve endurance of STT-MRAM cells. The long write
latency of STT-MRAM can cause read-after-write hazards
and pipeline stalls in SMs. In Hi-End, with the support of
the register cache and delay buffer, a register write request
instantly obtains a cache line in the register cache; hence, the
write delay is negligible.
B. DETAILED ARCHITECTURE OF HI-END COMPONENT
In this subsection, we discuss the detailed architecture and
operations of each structure in Hi-End, such as the register
cache,delay buffer,compression unit, and bank-level wear-
leveling.
Register cache: The register cache is an SRAM-based,
fast, and temporary cache memory for write-back data. The
register cache employs a direct-mapped cache structure to
simplify the hardware complexity. While the write perfor-
mance of STT-MRAM is much slower than that of SRAM,
the read performance of STT-MRAM is similar to that of
SRAM. Using this characteristic, the register cache in Hi-End
only allocates write data, unlike the general cache memory.
A read request can be quickly serviced from the STT-MRAM
6VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3008719, IEEE Access
W. Jeon et al.: Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for Energy-Efficient GPUs
FIGURE 9. Delay buffer structure.
register file due to its read performance. Therefore, the read
request does not require the support of the register cache.
Furthermore, requesting the previously written-back register
data is more probable than reusing the data previously read
from the register file. Hence, the register cache effectively
increases the write hit ratio by allocating only write data.
Using this approach, Hi-End can reduce the number of write
requests to the STT-MRAM register file, thereby increasing
the lifetime of the STT-MRAM cells. In this work, the
register cache contains 256 blocks with 1024-bit sized cache
lines. As shown in Fig. 8, a single cache line includes a 1-bit
valid flag and a 11-bit tag field. Unlike conventional cache
tags that use part of an address field, the tags in the register
cache utilizes the combination of a 6-bit warp ID and a 5-bit
register ID. The tag calculation logic is implemented with a
logical shifter and an adder.
Delay buffer: The role of the delay buffer is similar to
that of the store buffer in the conventional cache architecture.
The delay buffer temporarily stores the dataset evicted from
the register cache before the dataset are written to the STT-
MRAM register file. This buffer space is required for the
evicted data since several cycles are taken for data com-
pression and write operation to STT-MRAM cells. Note that
the write performance of STT-MRAM used in this work is
four times slower than that of SRAM cells, as explained in
Section II-C. In this work, we configure the delay buffer
to include 16 entries, as shown in Fig. 9. Our simulation
study reveals that this depth is sufficient to minimize the
structural hazards in the delay buffer considering the cycles
required by the compression unit and the STT-MRAM writes.
As mentioned previously, the data read requests can be read
from the delay buffer. This feature allows Hi-End to read
the data while they are being compressed, thereby hiding
compression latency. In addition, if the delay buffer does
not support the read operation, the operand collector can
obtain incorrect data from the STT-MRAM register file due
to the redundant write delay. This is because, during the time
between data leaves the register cache and is written to the
STT-MRAM register, no corresponding register exists in the
register cache, and only old data exists in the STT-MRAM
register file.
To support the read operation of the delay buffer, each en-
128-byte Original Data
4-byte Base
32-bit
Subtractor 32-bit
Subtractor 32-bit
Subtractor 32-bit
Subtractor
𝛥0𝛥1𝛥2𝛥30
Sign Extension
Comparator Sign Extension
Comparator Sign Extension
Comparator Sign Extension
Comparator
Compressible?
4-byte Base 𝛥0𝛥1𝛥2𝛥3𝛥30 Yes
Packing Data No
Uncompressed Data OutCompressed Data Out
FIGURE 10. Implementation of the compression unit [19].
try of the delay buffer has additional control fields as shown
in Fig. 9. These control fields include a 1-bit valid flag, 1-bit
allocate flag, 3-bit counter, 5-bit register ID, 6-bit warp ID,
and 1024-bit register data. While other fields contain general
information, the allocate flag and the counter contain unique
information for STT-MRAM. These two fields are used to
validate the completions of write operations. In case the
polling-based approach is used for the validation, every entry
of the delay buffer and STT-MRAM register bank should
communicate with each other, and the completion of a write
process must be verified in every cycle. Hi-End employs
the counter-based approach to avoid such an overhead. The
allocate flag indicates whether the corresponding entry is
currently writing data into the STT-MRAM register file. The
counter field contains the number of cycles remaining to
complete the write process of the corresponding entry. The
counter is set to 6 when a write process is initiated and
decreased by 1 while the allocate flag is 1. In detail, writing
data to the STT-MRAM register file requires two cycles for
compression and four cycles of STT-MRAM write latency;
hence, six cycles are required in total, and the counter must
be set accordingly.
Compression unit: We implement a compression unit
using the BDI algorithm used in warped-compression [19].
In the BDI algorithm, the compression ratio can be improved
by supporting various configurations for the base and delta.
However, a higher number of options for the base and delta
results in a higher hardware overhead; hence, we limit the
number of possible sizes of the base and delta. The size of
base is fixed to 4-byte, whereas the size of delta can be 0-
byte, 1-byte, or 2-byte. Based on our observations shown in
Section III, most GPU register file data can be efficiently
compressed using these three options for delta. Therefore, the
compression unit in Hi-End supports only three compression
options to simplify the hardware design. The hardware ar-
chitecture of the compression unit is described in Fig. 10.
VOLUME 4, 2016 7
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3008719, IEEE Access
W. Jeon et al.: Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for Energy-Efficient GPUs
Register data 0
4 warp register write operations
Register data 1 Register data 2 Register data 3
(a) Baseline
Register File Banks
(b) Warped-Compression
Register File Banks
(c) Hi-End
Register File Banks
FIGURE 11. Example of bank-level wear-leveling operation in Hi-End
compared to the baseline register file and warped-compression [19].
Hi-End can reduce the dynamic power for accessing the
STT-MRAM register file by compressing the data size. In
addition, we can reduce the number of activated banks in the
STT-MRAM register file when storing compressed data. For
instance, the compressed data will activate only 5 banks if
the data size is reduced by a 4-byte base and a 1-byte delta,
whereas the uncompressed 128-byte data accesses 16 banks
in the register file. Hence, Hi-End can effectively reduce the
access counts to STT-MRAM cells, and prolong the lifetime
of the register file based on BWL.
Bank-level wear-leveling: To prolong the lifetime of
the STT-MRAM cells, Hi-End applies BWL by exploiting
the advantages of the compressed data. Hi-End can reduce
the overall write counts for the STT-MRAM banks using
a hierarchical structure. However, some register banks may
have a shorter lifetime if the compressed data is written more
frequently to several specific banks. To avoid such cases, Hi-
End applies the wear-leveling mechanism, which can balance
the write accesses among multiple register file banks.
Fig. 11 presents an illustrative example of BWL of Hi-End
compared to the baseline register file without compression
and warped-compression [19]. Without using the compres-
sion technique, each register data is written across multiple
banks, as shown in Fig. 11(a). In the baseline model, all
banks are affected by the high write count, which results in
a short lifetime when the register file is implemented with
STT-MRAM. Warped-compression compresses register data,
uses less number of banks, and power gates the unused banks
to reduce leakage power, as shown in Fig. 11(b). However,
in case of STT-MRAM, power gating has a less prominent
effect compared to the case of SRAM because the leakage
power of STT-MRAM is extremely small. Furthermore, al-
though the lifetime of the unused banks can be prolonged, the
banks that frequently store compressed data are still affected
by high write count and short lifetimes. The lifetime of a
device is determined by the structure with the shortest life-
time. Consequently, the lifetime of the STT-MRAM-based
register file with warped-compression is equivalent to that
of the baseline register file. As shown in Fig. 11(c), Hi-
End compresses register data and stores them into multiple
TABLE 2. Simulation Parameter
Parameters Value
GPU Microarchitectural Parameter
Process 32 nm
Core clock frequency 700 MHz
SMs/GPU 15
Warp size 32-threads
Warp scheduling policy Greedy-Then-Oldest
Register file size 128 KB
Number of register banks/SM 64
Maximum number of warps/SM 48
Maximum number of threads/SM 1,536
Maximum number of registers/SM 32,786
Bit width/bank 64-bit
Number of entries/bank 256
Hi-End Architecture Parameter
Number of blocks in register cache 256
Register cache size 32.375 KB
Number of entries in delay buffer 16
Delay buffer size 2.03 KB
Compression unit energy/activation (pJ) 23
Compression unit leakage power (mW) 0.12
Decompression unit energy/activation (pJ) 21
Decompression unit leakage power (mW) 0.08
banks to avoid write concentration which occurs in warped-
compression. Using the wear-leveling technique, Hi-End can
decrease the write concentration and improve the lifetime of
the GPU device.
The Compression Indicator Table (CIT) in the bank arbiter
contains a compression range that indicates the compression
method used in the previous compression operation. The ar-
biter determines the number of banks to access by referring to
the compression range because the number of banks to access
changes based on the compression method. Furthermore, the
CIT contains the Bank Point Value (BPV), which indicates
the next bank ID of previously compressed data accessed. In
the write bank allocation process, the bank arbiter sends a
BPV to the compression unit with the data. The compression
unit identifies the bank ID to start a write operation using the
BPV. While the compression unit writes data to the register
file, it concurrently updates the CIT with a new BPV and
compression range. In the read operation, the decompression
unit can identify the number of banks to read by referring
to the compression range as well as identify the bank ID to
start the access process by subtracting the numbers of banks
to access from the BPV.
V. EVALUATION
A. METHODOLOGY
We evaluated our proposed architecture using cycle-accurate
simulator, GPGPU-Sim version 3.2.2 [23]. The detailed GPU
hardware parameters are shown in Table 2. The parameters
of the STT-MRAM and SRAM register files shown in Ta-
ble 1 are evaluated using circuit-level simulator, NVSim,
and configured from previous studies [4], [21]. The param-
eters of the compression/decompression unit are configured
from a previous study [19]. The detailed parameters of Hi-
8VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3008719, IEEE Access
W. Jeon et al.: Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for Energy-Efficient GPUs
0
0.2
0.4
0.6
0.8
1
1.2
2DCONV
2MM
3DCONV
3MM
BFS
BICG
BS
CONV
FDTD
GS
HW
LBM
LIB
MUM
ND
SP
SC
AVG
Normalized Register File Energy
SRAM STT STT+WB+BDI Hi-End
FIGURE 12. Normalized register file energy consumption.
0
0.2
0.4
0.6
0.8
1
1.2
2DCONV
2MM
3DCONV
3MM
BFS
BICG
BS
CONV
FDTD
GS
HW
LBM
LIB
MUM
ND
SP
SC
AVG
Normalized GPU IPC
SRAM STT STT+WB+BDI Hi-End
FIGURE 13. Normalized GPU IPC.
End, such as the register cache, delay buffer, and compres-
sion/decompression unit are shown in Table 2.
We selected 17 benchmarks from Parboil [25], CUDA
Software Development Kit [26], Rodinia [27], Poly-
Bench [28], and GPGPU-Sim benchmark suite [23]. Using
the benchmarks and architectural configuration above, we
compared the following four different architectures:
SRAM: GPU architecture with SRAM register file.
STT: GPU architecture with STT-MRAM register file
without additional techniques.
STT+WB(Write Buffer)+BDI: STT-MRAM-based reg-
ister file which includes a centralized write buffer and
BDI compression from the previous study [4].
Hi-End: The proposed architecture which includes
a register cache, a delay buffer and a compres-
sion/decompression unit with the BWL technique.
B. ENERGY EFFICIENCY
The normalized energy consumption of the register file is
shown in Fig. 12. Each data is normalized to the energy
consumption of the baseline SRAM-based register file. Using
the STT-MRAM directly with the register file instead of
using SRAM can reduce the register file energy consumption
by 43.98%, but it causes a 17.36% degradation in the GPU
Instructions per Cycle (IPC), as shown in Fig. 13. In several
benchmarks such as 3DCONV,BFS,GS,LBM, and ND,
STT shows relatively low energy consumption compared to
other benchmarks. As presented in Fig. 2 in Section II, the
SRAM-based register file energy in such benchmarks are
largely consumed by the SRAM leakage energy. In particular,
the leakage energy in BFS and GS are over 80% of the
total register file energy consumption. High leakage energy
consumption can be effectively decreased by directly using
the STT-MRAM without additional techniques. Furthermore,
as the leakage energy ratio increases, the read and write
energy ratio decreases, which results in the lower effect
of the additional techniques in Hi-End. Consequently, STT
shows the similar or the lowest energy consumption in such
benchmarks.
In the previous work (STT+WB+BDI), both the write
buffer and the STT-MRAM register file handle read requests
simultaneously, thereby requiring additional dynamic energy.
Hi-End accesses the STT-MRAM register file only if the
read request cannot obtain the data in the upper hierarchy,
such as the register cache or the delay buffer; hence, the
process is more energy-efficient. Hi-End obtains register files
in an energy-efficient manner owing to the following two
factors. The first involves the lower leakage current with
VOLUME 4, 2016 9
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3008719, IEEE Access
W. Jeon et al.: Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for Energy-Efficient GPUs
0
0.2
0.4
0.6
0.8
1
2DCONV
2MM
3DCONV
3MM
BFS
BICG
BS
CONV
FDTD
GS
HW
LBM
LIB
MUM
ND
SP
SC
AVG
Read Access Ratio
Register Cache STT-MRAM Delay Buffer
FIGURE 14. Read access ratio of Hi-End memory structures.
STT-MRAM. Since STT-MRAM is an NVM and does not
require a constant power supply, it affords almost no leak-
age current compared to SRAM. The second involves the
usage of the compression unit. STT-MRAM consumes high
dynamic power for read and write operations, as presented
in Table 1. Meanwhile, Hi-End reduces the number of bits
for read and write operations by compressing data for STT-
MRAM, thereby reducing the dynamic power. As a result,
Hi-End reduces 70.02% of the register file energy, whereas
STT+WB+BDI reduces only reduces 58.79% of the energy
compared to the SRAM-based register file.
C. IPC PERFORMANCE
Fig. 13 shows the GPU IPC performance normalized to
the baseline SRAM-based register file. Directly using STT-
MRAM with the register file instead of using SRAM de-
creases the performance by 17.36% due to the long write
latency of STT-MRAM. In particular, BS,HW,MUM, and
SC are affected more in STT in terms of both energy and
performance. As presented in Fig. 2 in Section II, these
applications have a large portion of read and write energy
consumption, which means the number of read and write
requests are larger than other benchmarks. As the number of
read and write increases, STT cannot fully cover the in-flight
requests due to the low write performance; hence, results in
performance degradation. The previous work that adopts WB
covers some of the problems and improves the IPC compared
to STT. However, merely using WB cannot fully hide the
long write latency of STT-MRAM; therefore, a performance
degradation of 8.12% is observed.
Fig. 14 shows the ratio of read access to the different
memory structures in Hi-End, such as the register cache,
STT-MRAM, and delay buffer. In Hi-End, 85.40% of the
read operations are served from the register cache, 14.45%
from the STT-MRAM-based register file, and 0.15% from
the delay buffer. Despite its low read access ratio, the delay
buffer is an essential component, as it prevents incorrect
data reads as explained in Section IV-B. As reads from the
delay buffer and the STT-MRAM with the decompression
0
0.2
0.4
0.6
0.8
1
2DCONV
2MM
3DCONV
3MM
BFS
BICG
BS
CONV
FDTD
GS
HW
LBM
LIB
MUM
ND
SP
SC
AVG
Normalized Bank Access
STT-MRAM Hi-End without BWL Hi-End
FIGURE 15. Normalized access count of the most accessed register bank.
unit require two and four cycles, respectively, the average
read latency is 1.43 cycles. By employing the register cache,
Hi-End has one cycle write latency, as all write accesses
are performed in the register cache. As a result, the overall
latency increase is only 0.43 read latency. This latency can
be hidden by warp scheduling or using other techniques
to hide short latency. Consequently, the evaluation results
demonstrate that Hi-End can achieve 99.14% in terms of
performance compared with the SRAM-based register file.
D. ENDURANCE
Despite the endurance of STT-MRAM being one of the
most critical issues in using STT-MRAM, it was considered
insignificant in previous studies [4]–[6]. In Hi-End, we im-
prove the endurance of STT-MRAM using two architectural
techniques, the register cache and BWL. Fig. 15 shows the
access count of the most accessed register bank normalized
to the directly used STT-MRAM register file. By reducing the
write access to the STT-MRAM register file using the register
cache, Hi-End decreases 90.25% of the write access to the
STT-MRAM register file compared to directly using the STT-
MRAM register file. However, the concentrated bank access
provides additional opportunities to improve the endurance
of STT-MRAM. Using BWL technique, Hi-End distributes
concentrated write accesses of compressed data, thereby
deceasing an additional 58.78% of bank accesses to the most
written bank compared to Hi-End without BWL. As a result,
Hi-End decreases 95.98% of the write accesses in the most
written bank compared to directly using the STT-MRAM
register file, thereby enhancing the endurance considerably.
E. AREA ANALYSIS
To analyze the area of Hi-End, we used NVSim for the
area simulation of the STT-MRAM-based register file,
register cache, and delay buffer [4], [21]. The compres-
sion/decompression unit are is evaluated based on a previous
study [19]. Table 3 shows the relative area of Hi-End by
structures compared to the baseline SRAM-based register
file. Hi-End reduces 8.19% of the area compared to the
10 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3008719, IEEE Access
W. Jeon et al.: Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for Energy-Efficient GPUs
TABLE 3. Relative Area of Hi-End Compared to SRAM-based Register File
Hi-End Component Area Ratio (%)
Compression/decompression unit 36.08
Register cache 30.55
STT-MRAM register file 19.59
Delay buffer 5.59
Area saving 8.19
Total 100
baseline SRAM-based register file despite the additional
structures in Hi-End. The main reason for area saving of Hi-
End is the high density of STT-MRAM compared to that
of SRAM; 128 KB of the STT-MRAM register file only
occupies 19.59% of the area compared to 128 KB of the
SRAM register file. Each register cache, delay buffer, com-
pression/decompression unit, and STT-MRAM-based regis-
ter file occupies 30.55%, 5.59%, 36.08%, and 19.59% of the
SRAM-based register file, respectively, thereby resulting in
an 8.19% area reduction.
VI. RELATED WORK
Hi-End aims to provide higher energy efficiency, lower per-
formance degradation, and higher endurance for GPUs with
STT-MRAM-based register file. To the best of our knowl-
edge, this work is the first that investigated compositive
optimization techniques of register file cache, compression,
and wear-leveling for STT-MRAM-based register file. In this
section, we discuss previous techniques that are related to our
study, such as optimization techniques for GPUs with STT-
MRAM register file and GPU register files with compression
techniques.
Register file adopting STT-MRAM: Recent studies at-
tempted to solve the power consumption of the GPU register
file by substituting SRAM to STT-MRAM. However, directly
using STT-MRAM in a GPU register file instead of SRAM
is challenging due to the drawbacks. Several previous works
attempted to solve the challenges of STT-MRAM with archi-
tectural approaches.
Li et al. proposed a hybrid STT-MRAM register file archi-
tecture [5]. They adopted bank-level write buffer and warp-
aware write back techniques to overcome the performance
drop of STT-MRAM-based register files. They employed two
SRAM buffers for every STT-MRAM register file. When
a warp is active, one SRAM buffer is used to write back
registers, and the next active warp uses the other SRAM
buffer. The two buffers provide services alternately between
the two warps, thereby hiding the long latency stall of STT-
MRAM. However, in this approach, the entire pipeline of the
GPU can be stalled when the active period of one warp is
not adequate to write back data from the SRAM write buffer
to the STT-MRAM register file. Furthermore, such per bank
write buffer designs cannot fully utilize SRAM resources
when the write requests are unbalanced.
To address the limitations of the previous approach [5],
Zhang et al. proposed centralized write buffer [4]. They
employed an SRAM buffer in the bank arbiter, and the buffer
is shared by all register banks. Furthermore, they adopted
the compression technique to decrease the dynamic power
consumption of STT-MRAM. However, in this work, the
centralized write buffer and the STT-MRAM register file
are concurrently accessed to minimize read latency. This
approach decreases the energy efficiency of the GPU register
file due to the redundant memory accesses. We compared the
effect of the centralized write buffer on the energy consump-
tion and performance with Hi-End in Section V. In addition,
unlike Hi-End, the endurance problem of STT-MRAM is not
considered in the previous studies.
Liu et al. proposed a Multi-Level Cell (MLC) STT-MRAM
register file for GPUs [6]. Using the MLC STT-MRAM
design, they achieve a larger storage density of register
file. Furthermore, they proposed an MLC-aware register file
remapping strategy and a warp rescheduling scheme to opti-
mize the different read and write performance between soft
and hard-bit operations in MLC STT-MRAM. This work fo-
cused on the optimization techniques for MLC STT-MRAM,
whereas our work uses hierarchical approaches and the wear-
leveling technique. Consequently, the MLC-aware technique
proposed herein is orthogonal to Hi-End; hence, it can be
combined.
GPU register file with compression: Previous works uti-
lized several compression techniques to minimize the static
or dynamic power consumption of the SRAM-based register
file in GPUs.
Lee et al. proposed warped-compression, an SRAM-based
GPU register file with BDI compression algorithm [19],
[29]. They restricted base and delta parameters and adopted
BDI compression directly to a warp register in an SRAM-
based register file GPU. After values are compressed, a
smaller number of register banks is used, and the unused
register banks are power gated; hence, the technique reduces
the power consumption of GPUs. In addition to adopting
the compression technique to an STT-MRAM-based GPU
register file, we propose additional optimization techniques
specific to STT-MRAM. Unlike SRAM that merely benefits
from power saving, the compression technique for STT-
MRAM additionally benefits from endurance with the sup-
port of BWL in Hi-End. In addition, the effect of power
saving due to the compression technique of STT-MRAM
is more prominent than that of SRAM, since STT-MRAM
requires a higher dynamic power consumption.
Wong et al. proposed warp approximation, which stores
a representative value for a warp using an approximation
technique [30]. They demonstrated effects similar to those
of warped-compression and reduced the number of read or
write bits, thereby reducing the dynamic and static power
of the SRAM-based register file. Their approach can fur-
ther increase the compression ratio compared to warped-
compression, as values that differ slightly from the rep-
resentative value can be additionally removed. Since the
approximation-based compression technique is orthogonal to
Hi-End, it can be applied to our work.
VOLUME 4, 2016 11
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3008719, IEEE Access
W. Jeon et al.: Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for Energy-Efficient GPUs
VII. CONCLUSION
With the scaling down of CMOS and adaption of larger
register files by the latest GPU devices, the leakage current
of conventional SRAM-based register files becomes a crit-
ical problem. Recent studies proposed the usage of STT-
MRAM as a register file because of its low leakage current
and acceptable read performance. However, STT-MRAM-
based GPU register files must be carefully designed due to
write overheads and endurance problems. These challenges
have been addressed in various existing techniques. However,
write latency and endurance issues were not completely
addressed. Hence, addressing these issues, we propose Hi-
End register file architecture for GPUs. In Hi-End, we adopt
a register cache to exploit the locality in the register file and a
delay buffer to cover evicted data. Moreover, we implement a
compression unit to reduce high dynamic power and employ
BWL to address concentrated accesses of compressed data.
Our experimental results demonstrate that Hi-End reduces
energy consumption by 70.02% with only a 0.86% perfor-
mance degradation compared to an SRAM-based register file.
Moreover, Hi-End reduces 95.98% of the write access to the
register bank by distributing concentrated bank accesses and
reducing write accesses with register caches.
ACKNOWLEDGMENT
Won Jeon and Jun Hyun Park contributed equally to this
work.
REFERENCES
[1] NVIDIA, NVIDIA’s Next Generation CUDA Compute Architecture:
Fermi, 2009.
[2] ——, NVIDIA A100 Tensor Core GPU Architecture: UNPRECEDENTED
ACCELERATION AT EVERY SCALE, 2020.
[3] J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M.
Aamodt, and V. J. Reddi, “Gpuwattch: enabling energy optimizations in
gpgpus,” in ACM SIGARCH Computer Architecture News, vol. 41, no. 3.
ACM, 2013, pp. 487–498.
[4] H. Zhang, X. Chen, N. Xiao, and F. Liu, “Architecting energy-efficient stt-
ram based register file on gpgpus via delta compression,” in Proceedings
of the 53rd Annual Design Automation Conference. ACM, 2016, p. 119.
[5] G. Li, X. Chen, G. Sun, H. Hoffmann, Y. Liu, Y. Wang, and H. Yang,
“A stt-ram-based low-power hybrid register file for gpgpus,” in 2015 52nd
ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 2015,
pp. 1–6.
[6] X. Liu, M. Mao, X. Bi, H. Li, and Y. Chen, “An efficient stt-ram-based
register file in gpu architectures,” in The 20th Asia and South Pacific
Design Automation Conference. IEEE, 2015, pp. 490–495.
[7] S. Mittal, “A survey of techniques for architecting and managing gpu
register file,” IEEE Transactions on Parallel and Distributed Systems,
vol. 28, no. 1, pp. 16–28, 2016.
[8] H. Zhang, X. Chen, N. Xiao, L. Wang, F. Liu, W. Chen, and Z. Chen,
“Shielding stt-ram based register files on gpus against read disturbance,”
ACM Journal on Emerging Technologies in Computing Systems (JETC),
vol. 13, no. 2, pp. 1–17, 2016.
[9] Z. Gong, K. Qiu, W. Chen, Y. Ni, Y. Xu, and J. Yang, “Redesigning
pipeline when architecting stt-ram as registers in rad-hard environment,
Sustainable Computing: Informatics and Systems, vol. 22, pp. 206–218,
2019.
[10] Q. Deng, Y. Zhang, M. Zhang, and J. Yang, “Towards warp-scheduler
friendly stt-ram/sram hybrid gpgpu register file design,” in 2017
IEEE/ACM International Conference on Computer-Aided Design (IC-
CAD). IEEE, 2017, pp. 736–742.
[11] Y. Ni, Z. Gong, W. Chen, C. Yang, and K. Qiu, “State-transition-aware
spilling heuristic for mlc stt-ram-based registers,” VLSI Design, vol. 2017,
2017.
[12] X. Guo, E. Ipek, and T. Soyata, “Resistive computation: avoiding
the power wall with low-leakage, stt-mram based computing,ACM
SIGARCH Computer Architecture News, vol. 38, no. 3, pp. 371–382, 2010.
[13] J. Zhang, M. Jung, and M. Kandemir, “Fuse: Fusing stt-mram into gpus to
alleviate off-chip memory access overheads,” in 2019 IEEE International
Symposium on High Performance Computer Architecture (HPCA). IEEE,
2019, pp. 426–439.
[14] NVIDIA, NVIDIA’s Next Generation CUDA Compute Architecture: Ke-
pler GK110, 2012.
[15] ——, NVIDIA GeForce GTX 980: Featuring Maxwell, The Most Advanced
GPU Ever Made, 2014.
[16] ——, NVIDIA Tesla P100: The Most Advanced Datacenter Accelerator
Ever Built, 2016.
[17] ——, NVIDIA Tesla V100 GPU Architecture: THE WORLD’S MOST
ADVANCED DATA CENTER GPU, 2017.
[18] ——, NVIDIA Turing GPU Architecture: Graphics Reinvented, 2018.
[19] S. Lee, K. Kim, G. Koo, H. Jeon, W. W. Ro, and M. Annavaram, “Warped-
compression: enabling power efficient gpus through register compression,
in ACM SIGARCH Computer Architecture News, vol. 43, no. 3. ACM,
2015, pp. 502–514.
[20] S. Mittal, J. S. Vetter, and D. Li, “A survey of architectural approaches
for managing embedded dram and non-volatile on-chip caches,” IEEE
Transactions on Parallel and Distributed Systems, vol. 26, no. 6, pp. 1524–
1537, 2014.
[21] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, “Nvsim: A circuit-level
performance, energy, and area model for emerging nonvolatile memory,
IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, vol. 31, no. 7, pp. 994–1007, 2012.
[22] S. Yazdanshenas, M. R. Pirbasti, M. Fazeli, and A. Patooghy, “Coding last
level stt-ram cache for high endurance and low power,” IEEE computer
architecture letters, vol. 13, no. 2, pp. 73–76, 2013.
[23] A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt,
“Analyzing cuda workloads using a detailed gpu simulator,” in 2009
IEEE International Symposium on Performance Analysis of Systems and
Software. IEEE, 2009, pp. 163–174.
[24] G. Pekhimenko, V. Seshadri, O. Mutlu, P. B. Gibbons, M. A. Kozuch,
and T. C. Mowry, “Base-delta-immediate compression: practical data
compression for on-chip caches,” in Proceedings of the 21st international
conference on Parallel architectures and compilation techniques. ACM,
2012, pp. 377–388.
[25] J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari,
G. D. Liu, and W.-m. W. Hwu, “Parboil: A revised benchmark suite for
scientific and commercial throughput computing,” Center for Reliable and
High-Performance Computing, vol. 127, 2012.
[26] NVIDIA, NVIDIA CUDA SDK Code Sample 4.0, March 2016.
[27] S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron,
“A characterization of the rodinia benchmark suite with comparison to
contemporary cmp workloads,” in IEEE International Symposium on
Workload Characterization (IISWC’10). IEEE, 2010, pp. 1–11.
[28] L.-N. Pouchet, “Polybench: The polyhedral benchmark suite,” URL:
http://www. cs. ucla. edu/pouchet/software/polybench, 2012.
[29] S. Lee, K. Kim, G. Koo, H. Jeon, M. Annavaram, and W. W. Ro, “Improv-
ing energy efficiency of gpus through data compression and compressed
execution,IEEE Transactions on Computers, vol. 66, no. 5, pp. 834–847,
2016.
[30] D. Wong, N. S. Kim, and M. Annavaram, “Approximating warps with
intra-warp operand value similarity,” in 2016 IEEE International Sympo-
sium on High Performance Computer Architecture (HPCA). IEEE, 2016,
pp. 176–187.
12 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3008719, IEEE Access
W. Jeon et al.: Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for Energy-Efficient GPUs
WON JEON received the B.S. degree in electrical
and electronic engineering from Yonsei Univer-
sity, Seoul, Korea, in 2014. He is currently work-
ing toward the Ph.D. degree in the Embedded Sys-
tems and Computer Architecture Laboratory, the
School of Electrical and Electronic Engineering,
Yonsei University. His current research interests
are GPU memory systems, non-volatile memory,
processing-in-memory architecture designs, and
approximate computing for neural network appli-
cations. He is a student member of the IEEE.
JUN HYUN PARK received the B.S. degree in
electrical and electronic engineering from Chung-
Ang University, Seoul, Korea, in 2018 and the
M.S. degree in electrical and electronic engineer-
ing from Yonsei University, Seoul, Korea, in 2020.
He is currently working as an IP design engineer
with System LSI Division, Samsung Electronics.
His current research interests are interconnection
technology in SoC and GPU architecture.
YOONSOO KIM received the B.S. and M.S.
degrees in electrical and electronic engineering
from Yonsei University, Seoul, Korea, in 2016 and
2018. He is currently working as a system engineer
with NAND Solution Division, SK Hynix. His
current research interests are functionality of solid
state drive and GPU architecture.
GUNJAE KOO received the B.S. and M.S. de-
grees in electrical and computer engineering from
Seoul National University in 2001 and 2003 re-
spectively, and Ph.D. degree in electrical engineer-
ing from the University of Southern California in
2018. Currently, he is an assistant professor in the
department of computer science and engineering,
Korea University. His research interest is in the
general area of computer system architecture and
spans parallel processor architecture, storage and
memory systems, accelerators, and secure processor architecture. Prior to
joining Korea University, he was an assistant professor at Hongik Univer-
sity. His industry experience includes a senior research engineer with LG
Electronics and a research intern with Intel. He is a member of the IEEE and
the ACM.
WON WOO RO received the B.S. degree in elec-
trical engineering from Yonsei University, Seoul,
Korea, in 1996, and the M.S. and Ph.D. degrees
in electrical engineering from the University of
Southern California, in 1999 and 2004, respec-
tively. He worked as a research scientist with
the Electrical Engineering and Computer Science
Department, University of California, Irvine. He
currently works as a professor with the School
of Electrical and Electronic Engineering, Yonsei
University. Prior to joining Yonsei University, he worked as an assistant
professor with the Department of Electrical and Computer Engineering,
California State University, Northridge. His industry experience includes a
college internship with Apple Computer, Inc. and a contract software engi-
neer with ARM, Inc. His current research interests include high-performance
microprocessor design, GPU microarchitectures, neural network accelera-
tors, and memory hierarchy design. He is a senior member of the IEEE.
VOLUME 4, 2016 13
... components, consuming around 15-20% of total GPU energy [9]- [13]. Over the past decade, researchers have explored several architectural techniques to minimize the energy consumption of the GPU register file [9]- [11], [14]- [18]. One attractive solution to minimize the energy consumption of the register file is to adopt emerging non-volatile memories (NVMs) -such as spin-transfer torque magnetoresistive random access memory (STT-MRAM) and spin-orbit torque MRAM (SOT-MRAM) [11], [14], [16] -as a substitute for static random-access memory (SRAM) of an existing register file. ...
... Over the past decade, researchers have explored several architectural techniques to minimize the energy consumption of the GPU register file [9]- [11], [14]- [18]. One attractive solution to minimize the energy consumption of the register file is to adopt emerging non-volatile memories (NVMs) -such as spin-transfer torque magnetoresistive random access memory (STT-MRAM) and spin-orbit torque MRAM (SOT-MRAM) [11], [14], [16] -as a substitute for static random-access memory (SRAM) of an existing register file. Leveraging their low leakage power, implementing the register file using NVMs significantly reduces the leakage energy consumption. ...
... Even with its lower leakage power, NVM cannot be solely used as the register file due to the longer access latency than SRAM, as shown in TABLE 1. To overcome the latency problem while taking advantage of low leakage power, researchers have proposed the hybrid or hierarchical register file composed of SRAM and NVM [11], [14], [19]. In the hybrid register file [14], the SRAM write buffers are inserted ahead of each STT-MRAM-based register bank. ...
Article
Full-text available
Graphics processing units (GPUs) achieve high throughput by exploiting a high degree of thread-level parallelism (TLP). To support such high TLP, GPUs have a large-sized register file to store the context of all threads, consuming around 20% of total GPU energy. Several previous studies have attempted to minimize the energy consumption of the register file by implementing an emerging non-volatile memory (NVM), leveraging its higher density and lower leakage power over SRAMs. To amortize the cost of long access latency of NVM, prior work adopts a hierarchical register file consisting of an SRAM-based register cache and NVM-based registers where the register cache works as a write buffer. To get the register cache index, they use the partially selected bits of warp ID and register ID. This work observes that such an index calculation causes three types of contentions leading to the underutilization of the register cache: inter-warp , intra-warp , and false contentions. To minimize such contentions, this paper proposes a thread context-aware register cache (TEA-RC) in GPUs. In TEA-RC, the cache index is calculated considering the high correlation between the number of scheduled threads and the register usage of threads. The proposed design shows 28.5% higher performance and 9.1 percentage point lower energy consumption over the conventional register cache that concatenates three bits of warp ID and five bits of register ID to compute the cache index.
... In terms of the recent GPU applications of nonvolatile memory (NVM) technologies, a novel register file architecture and its management system for GPUs is proposed by Jeon et al. 13 Inci et al. put forward a framework to characterize, model, and analyze NVM-based caches in GPU architectures for deep learning (DL). 14 In addition, the recently released GPU architecture of NVIDIA reinforces the importance of on-chip storage in parallel computing architectures. ...
Article
Full-text available
With the rapid development of portable computing devices and users’ demand for high-quality graphics rendering, embedded Graphics Processing Units (GPU) systems for graphics processing are increasingly turning into a key component of computer architecture to enhance computability. The cache system based on traditional static random access memory (SRAM) plays a crucial role in GPUs. But high leakage, low lifetime and poor integration problems deeply plague the science and engineering field. In the paper, a novel magnetic random access memory (MRAM) based cache architecture of GPU systems is proposed for highly efficient graphics processing and computing accelerating, with the merits of high speed, long endurance, strong interference resistance, and ultra-low power consumption. Spin transfer torque-MRAM and spin orbit torque-MRAM are utilized in off-chip and on-chip caches, respectively. A controller design scheme with prefetching modules and optimized cache coherency protocols are adopted. After testing and evaluating with multiple loads, neural network models and datasets, the simulation results show that the proposed system can achieve up to 28%, 56%, and 66.45% optimizations mostly in terms of speed, energy and leakage power, respectively.
... While it is unlikely that a computing system will reach this perfect efficiency anytime in the near future, the conventional approaches in use are still multiple orders of magnitude away from the limit. According to Landauer's work, the theoretical minimum energy that can be consumed when a bit is flipped is 2.6 × 10 −21 J [20], but even current state-of-the art computing systems require something on the order of 1.0 × 10 −13 J [21]. ...
Article
Full-text available
There are many real-world applications that require high-performance mobile computing systems for onboard, real-time processing of gathered data due to latency, reliability, security, or other application constraints. Unfortunately, most existing high-performance mobile computing systems require a prohibitively high power consumption in the face of the limited power available from the batteries typically used in these applications. For high-performance mobile computing to be practical, alternative hardware designs are needed to increase the computing performance while minimizing the required power consumption. This article surveys the state-of-the-art in high-efficiency, high-performance onboard mobile computing, focusing on the latest developments. It was found that more research is needed to design high-performance mobile computing systems while minimizing the required power consumption to meet the needs of these applications.
... El banco de registros es una de las estructuras de memoria que más energía consume en una GPU, siendo responsable de aproximadamente un 20 % del consumo total de energía del dispositivo [9] y su consumo aumenta generación tras generación. Por ejemplo, el banco de registros de NVIDIA Tesla V100, con 20 MB, es 5 veces más grande que su homólogo en Tesla K40 [19]. ...
Conference Paper
Full-text available
La creciente demanda de paralelismo de las aplicaciones de propósito general en unidades de procesamiento gráfico (GPU) empuja hacia bancos de registros cada vez más grandes y con un mayor consumo energético en sucesivas generaciones. Reducir la tensión de alimentación más allá de su límite de seguridad es una forma eficaz de mejorar la eficiencia energética del banco de registros. Sin embargo, operar en estas tensiones tan bajas compromete la fiabilidad del circuito. Este trabajo tiene como objetivo tolerar fallos permanentes debidos a variaciones en el proceso de fabricación del banco de registros de una GPU operando por debajo del límite de seguridad. Para ello, este artículo propone una técnica microarquitectónica de redirección de registros, RRCD, que aprovecha la redundancia de los datos inherente en las aplicaciones para comprimir registros en tiempo de ejecución y sin la asistencia del compilador ni modificaciones en el repertorio de instrucciones. En lugar de deshabilitar toda una entrada de registro defectuosa, RRCD aprovecha las celdas fiables en una entrada defectuosa para redirigir y almacenar registros comprimidos. Los resultados experimentales muestran que, con más de un tercio de entradas de registro defectuosas, RRCD asegura la fiabilidad del banco de registros y reduce el consumo de energía en un 47% con respecto a un diseño convencional alimentado con una tensión nominal. El ahorro de energía es un 21% en comparación con un esquema de suavizado de ruido de tensión que opera en el límite seguro de tensión. Estos beneficios se obtienen con un impacto en el rendimiento y área del sistema menor que un 2 y 6%, respectivamente.
... El banco de registros es una de las estructuras de memoria que más energía consume en una GPU, siendo responsable de aproximadamente un 20 % del consumo total de energía del dispositivo [9] y su consumo aumenta generación tras generación. Por ejemplo, el banco de registros de NVIDIA Tesla V100, con 20 MB, es 5 veces más grande que su homólogo en Tesla K40 [19]. ...
Preprint
Full-text available
La creciente demanda de paralelismo de las aplicaciones de propósito general en unidades de procesamiento gráfico (GPU) empuja hacia bancos de registros cada vez más grandes y con un mayor consumo energético en sucesivas generaciones. Reducir la tensión de alimentación más allá de su límite de seguridad es una forma eficaz de mejorar la eficiencia energética del banco de registros. Sin embargo, operar en estas tensiones tan bajas compromete la fiabilidad del circuito. Este trabajo tiene como objetivo tolerar fallos permanentes debidos a variaciones en el proceso de fabricación del banco de registros de una GPU operando por debajo del límite de seguridad. Para ello, este artículo propone una técnica microarquitectónica de redirección de registros, RRCD, que aprovecha la redundancia de los datos inherente en las aplicaciones para comprimir registros en tiempo de ejecución y sin la asistencia del compilador ni modificaciones en el repertorio de instrucciones. En lugar de deshabilitar toda una entrada de registro defectuosa, RRCD aprovecha las celdas fiables en una entrada defectuosa para redirigir y almacenar registros comprimidos. Los resultados experimentales muestran que, con más de un tercio de entradas de registro defectuosas, RRCD asegura la fiabilidad del banco de registros y reduce el consumo de energía en un 47% con respecto a un diseño convencional alimentado con una tensión nominal. El ahorro de energía es un 21% en comparación con un esquema de suavizado de ruido de tensión que opera en el límite seguro de tensión. Estos beneficios se obtienen con un impacto en el rendimiento y área del sistema menor que un 2 y 6%, respectivamente.
... Also, they adopted a write buffer scheme to hide long write latency for MRAM, which is widely used in many other previous work on MRAM-based on-chip caches [ [41]. In addition to MRAM-based large caches, several researchers proposed MRAM-based register files [17] and L1 caches [20] [36], which also avoid long MRAM write latency based on small write buffers. ...
Article
Full-text available
Monolithic 3D (M3D) integration has been emerged as a promising technology for fine-grained 3D stacking. As the M3D integration offers extremely small dimension of via in a nanometer-scale, it is beneficial for small microarchitectural blocks such as caches, register files, translation look-aside buffers (TLBs), etc. However, since the M3D integration requires low-temperature process for stacked layers, it causes lower performance for stacked transistors compared to the conventional 2D process. In contrast, non-volatile memory (NVM) such as magnetic RAM (MRAM) is originally fabricated at a low temperature, which enables the M3D integration without performance degradation. In this paper, we propose an energy-efficient unified L2 TLB-cache architecture exploiting M3D-based SRAM/MRAM hybrid memory. Since the M3D-based SRAM/MRAM hybrid memory consumes much smaller energy than the conventional 2D SRAM-only memory and 2D SRAM/MRAM hybrid memory, while providing comparable performance, our proposed architecture improves energy efficiency significantly. Especially, as our proposed architecture changes the memory partitioning of the unified L2 TLB-cache depending on the L2 cache miss rate, it maximizes the energy efficiency for parallel workloads suffering extremely high L2 cache miss rate. According to our analysis using PARSEC benchmark applications, our proposed architecture reduces the energy consumption of L2 TLB+L2 cache by up to 97.7% (53.6% on average), compared to the baseline with the 2D SRAM-only memory, with negligible impact on performance. Furthermore, our proposed technique reduces the memory access energy consumption by up to 32.8% (10.9% on average), by reducing memory accesses due to TLB misses.
... 20% of the total GPU energy [22], and grows generation after generation. For instance, the register file of the NVIDIA Tesla V100, with 20 MB, is more than 5× larger than its counterpart in the Tesla K40 [35]. ...
Article
Full-text available
The ever-increasing parallelism demand of General-Purpose Graphics Processing Unit (GPGPU) applications pushes toward larger and more energy-hungry register files in successive GPU generations. Reducing the supply voltage beyond its safe limit is an effective way to improve the energy efficiency of register files. However, at these operating voltages, the reliability of the circuit is compromised. This work aims to tolerate permanent faults from process variations in large GPU register files operating below the safe supply voltage limit. To do so, this paper proposes a microarchitectural patching technique, DC-Patch, exploiting the inherent data redundancy of applications to compress registers at run-time with neither compiler assistance nor instruction set modifications. Instead of disabling an entire faulty register file entry, DC-Patch leverages the reliable cells within a faulty entry to store compressed register values. Experimental results show that, with more than a third of faulty register entries, DC-Patch ensures a reliable operation of the register file and reduces the energy consumption by 47% with respect to a conventional register file working at nominal supply voltage. The energy savings are 21% compared to a voltage noise smoothing scheme operating at the safe supply voltage limit. These benefits are obtained with less than 2 and 6% impact on the system performance and area, respectively.
Article
Spin-Transfer Torque Magnetic Random-Access Memory (STT-MRAM) is an emerging non-volatile memory technology that has been received significant attention due to its higher density and lower leakage current over SRAM. One compelling use case is to employ STT-MRAM as a GPU Register File (RF) to reduce its massive energy consumption. One critical challenge is that STT-MRAM has longer access latency and higher dynamic power consumption than SRAM, which motivates the hierarchical RF that places a small SRAM-based Register Cache (RC) between functional units and STT-MRAM RF. The RC acts as the write buffer, so all the writes on the RF are first performed on the RC. In the presence of a conflict miss, the RC writes back the corresponding cache line into the RF. In this work, we observe that a large amount of such write-back operations are unnecessary because they include register values that are never used again. Leveraging this observation, we propose a Compiler-ASsisted Hierarchical Register File in GPUs (CASH-RF) that optimizes STT-MRAM accesses by removing dead register values. In CASH-RF, unnecessary write-back operations are detected by the compiler via the last consumer analysis. At runtime, the corresponding RC lines are discarded after the last references without being updated to the RF. Compared to the baseline GPUs, CASH-RF removes 59.5% of write-back operations, which leads to 54.7% lower RF energy consumption with only 2.6% of performance degradation.
Article
Full-text available
Multilevel Cell Spin-Transfer Torque Random Access Memory (MLC STT-RAM) is a promising nonvolatile memory technology to build registers for its natural immunity to electromagnetic radiation in rad-hard space environment. Unlike traditional SRAM-based registers, MLC STT-RAM exhibits unbalanced write state transitions due to the fact that the magnetization directions of hard and soft domains cannot be flipped independently. This feature leads to nonuniform costs of write states in terms of latency and energy. However, current SRAM-targeting register allocations do not have a clear understanding of the impact of the different write state-transition costs. As a result, those approaches heuristically select variables to be spilled without considering the spilling priority imposed by MLC STT-RAM. Aiming to address this limitation, this paper proposes a state-transition-aware spilling cost minimization (SSCM) policy, to save power when MLC STT-RAM is employed in register design. Specifically, the spilling cost model is first constructed according to the linear combination of different state-transition frequencies. Directed by the proposed cost model, the compiler picks up spilling candidates to achieve lower power and higher performance. Experimental results show that the proposed SSCM technique can save energy by 19.4% and improve the lifetime by 23.2% of MLC STT-RAM-based register design.
Article
Full-text available
To support their massively-multithreaded architecture, GPUs use very large register file (RF) which has a capacity higher than even L1 and L2 caches. In total contrast, traditional CPUs use tiny RF and much larger caches to optimize latency. Due to these differences, along with the crucial impact of RF in determining GPU performance, novel and intelligent techniques are required for managing GPU RF. In this paper, we survey the techniques for designing and managing GPU RF. We discuss techniques related to performance, energy and reliability aspects of RF. To emphasize the similarities and differences between the techniques, we classify them along several parameters. The aim of this paper is to synthesize the state-of-art developments in RF management and also stimulate further research in this area.
Article
Electromagnetic radiation effects can cause several types of errors on traditional SRAM-based registers such as single event upset (SEU) and single event functional interrupt (SEFI). Especially in aerospace where radiation is quite intense, the stability and correctness of systems are greatly affected. By exploiting the beneficial features of high radiation resistance and non-volatility, spin-transfer torque RAM (STT-RAM), a kind of emerging nonvolatile memory (NVM), is promising to be used as registers to avoid errors caused by radiation. However, substituting SRAM with STT-RAM in registers will affect system performance because STT-RAM suffers from long write latency. The early write termination (EWT) method has been accepted as an effective technique to mitigate write problems by terminating redundant writes. Based on the above background, this paper proposes to build registers by STT-RAM for embedded systems in rad-hard environment. Targeting the microarchitecture level of pipeline, the impact of architecting STT-RAM-based registers is discussed considering data hazard due to data dependencies. Furthermore, integrated with the EWT technique, Read Merging and Write Skipping methods are proposed to eliminate redundant normal reads and writes. As a result of carrying out these actions, the energy and performance can be improved greatly. The results report 70% (and 77%) and 39% (and 47%) improvements on performance (and energy) by the proposed Read Merging and Write Skipping compared to the cases where STT-RAM is naively used as registers and intelligently used by integrating EWT, respectively.
Conference Paper
Modern Graphics Processing Units (GPUs) widely adopt large SRAM based register file (RF) to enable fast context-switch. A large SRAM RF may consume 20% to 40% GPUpower, which has come one of the major design challenges for GPUs. Recent studies mitigate the issue through hybrid RF designs that architect a large STT-RAM (Spin Transfer torque magnetic memory) RF and a small SRAM buffer. However, the long STT-RAM writes latency throttles the data exchange between-RAM and SRAM, which deprecates warp scheduler with frequent context switches, e.g., round robin scheduler. In this paper, we propose HC-RF, a warp-scheduler friendly hybrid RF design using novel SRAM/STT-RAM hybrid cell (HC)structure. HC-RF exploits cell level integration to improve the effective bandwidth between STT-RAM and SRAM. By enabling silent data transfer from SRAM to STT-RAM without blocking banks, HC-RF supports concurrent context-switching and decouples its dependency on warp scheduler. Our experimental results show that, on average, HC-RF achieves 50% performance improvement and 44% energy consumption reduction over the coarse-grained hybrid design when adopting LRR(Loose RoundRobin) warp scheduler.
Article
GPU design trends show that the register file size will continue to increase to enable even more thread level parallelism. As a result register file consumes a large fraction of the total GPU chip power. This paper explores register file data compression for GPUs to improve power efficiency. Compression reduces the width of the register file read and write operations, which in turn reduces dynamic power. This work is motivated by the observation that the register values of threads within the same warp are similar, namely the arithmetic differences between two successive thread registers is small. Compression exploits the value similarity by removing data redundancy of register values. Without decompressing operand values some instructions can be processed inside register file, which enables to further save energy by minimizing data movement and processing in power hungry main execution unit. Evaluation results show that the proposed techniques save 25 percent of the total register file energy consumption and 21 percent of the total execution unit energy consumption with negligible performance impact.
Article
To address the high energy consumption issue of SRAM on GPUs, emerging Spin-Transfer Torque (STT-RAM) memory technology has been intensively studied to build GPU register files for better energy-efficiency, thanks to its benefits of low leakage power, high density, and good scalability. However, STT-RAM suffers from the read disturbance issue, which stems from the fact that the voltage difference between read current and write current becomes smaller as technology scales. The read disturbance leads to high error rates for read operations, which cannot be effectively protected by the SEC-DED ECC on large-capacity register files of GPUs. Prior schemes (e.g., read-restore) to mitigate the read disturbance usually incur either non-trivial performance loss or excessive energy overhead, thus not applicable for the GPU register file design that aims to achieve both high performance and energy-efficiency. To combat the read disturbance, we propose a novel software-hardware co-designed solution (i.e., Red-Shield), which consists of three optimizations to overcome the limitations of the existing solutions. First, we identify dead reads at compiling stage and augment instructions to avoid unnecessary restores. Second, we employ a small read buffer to accommodate register reads with high-access locality to further reduce restores. Third, we propose an adaptive restore mechanism to selectively pick the suitable restore scheme, according to the busy status of corresponding register banks. Experimental results show that our proposed design can effectively mitigate the performance loss and energy overhead caused by restore operations while still maintaining the reliability of reads.
Conference Paper
To facilitate efficient context switches, GPUs usually employ a large-capacity register file to accommodate a massive amount of context information. However, the large register file introduces high power consumption, flowing to high leakage power SRAM cells. Emerging non-volatile STT-RAM memory has recently been studied as a potential replacement to alleviate the leakage challenge when constructing register files on GPUs. Unfortunately, due to the long write latency and high energy consumption associated with write operations in STT-RAM, simply replacing SRAM with STTRAM for register files would incur non-trivial performance overhead and only bring marginal energy benefits. In this paper, we propose to optimize STT-RAM based GPU register files for better energy-efficiency and performance via two techniques. First, we employ a light-weight compression framework with awareness of register value similarity. It is coupled with a group-based write driver control to mitigate the high energy overhead caused by STT-RAM writes. Second, to address the long write latency overhead of STT-RAM, we propose a centralized SRAM-based write buffer design to efficiently absorb STT-RAM writes with better buffer utilization, rather than the conventional design with distributed per-bank based write buffers. The experimental results show that our STT-RAM based register file design consumes only 37.4% energy over the SRAM baseline, while incurring only negligible performance degradation.
Article
This paper presents Warped-Compression, a warp-level register compression scheme for reducing GPU power consumption. This work is motivated by the observation that the register values of threads within the same warp are similar, namely the arithmetic differences between two successive thread registers is small. Removing data redundancy of register values through register compression reduces the effective register width, thereby enabling power reduction opportunities. GPU register files are huge as they are necessary to keep concurrent execution contexts and to enable fast context switching. As a result register file consumes a large fraction of the total GPU chip power. GPU design trends show that the register file size will continue to increase to enable even more thread level parallelism. To reduce register file data redundancy warped-compression uses low-cost and implementation-efficient base-delta-immediate (BDI) compression scheme, that takes advantage of banked register file organization used in GPUs. Since threads within a warp write values with strong similarity, BDI can quickly compress and decompress by selecting either a single register, or one of the register banks, as the primary base and then computing delta values of all the other registers, or banks. Warped-compression can be used to reduce both dynamic and leakage power. By compressing register values, each warp-level register access activates fewer register banks, which leads to reduction in dynamic power. When fewer banks are used to store the register content, leakage power can be reduced by power gating the unused banks. Evaluation results show that register compression saves 25% of the total register file power consumption.