ArticlePDF Available

Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for Energy-Efficient GPUs

July 2020
IEEE Access PP(99):1-1

July 2020
PP(99):1-1

DOI:10.1109/ACCESS.2020.3008719

License
CC BY 4.0

Authors:

Won Jeon

Electronics and Telecommunications Research Institute

Gunjae Koo

Hongik University

Show all 5 authorsHide

Modern Graphics Processing Units (GPUs) require large hardware resources for massive parallel thread executions. In particular, modern GPUs have a large register file composed of Static Random Access Memory (SRAM). Due to the high leakage current of SRAM, the register file consumes approximately 20% of the total GPU energy. The energy efficiency of the register file becomes more critical as the throughput of GPUs increases. For more energy-efficient GPUs, the usage of non-volatile memory such as Spin-Transfer Torque Magnetic Random Access Memory (STT-MRAM) as the GPU register file has been studied extensively. STT-MRAM requires a lower leakage current compared to SRAM and provides an appropriate read performance. However, using STT-MRAM directly in the GPU register file causes problems in performance and endurance because of complicated write procedures and material characteristics. To overcome these challenges, we propose a novel register file architecture and its management system for GPUs, named Hi-End, which exploits the data locality and compressibility of the register file. For STT-MRAM-based GPU register files, Hi-End increases the data write performance and endurance by caching and data compression, respectively. In our evaluation, Hi-End enhances the energy efficiency of a GPU register file by 70.02% and reduces the write operations by up to 95.98% with negligible performance degradation compared to SRAM-based register files.

Hi-End architecture.

…

Example of bank-level wear-leveling operation in Hi-End compared to the baseline register file and warped-compression [19].

…

Normalized register file energy consumption.

Figures - available via license: Creative Commons Attribution 4.0 International

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2020.3008719, IEEE Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.

Digital Object Identiﬁer 10.1109/ACCESS.2017.DOI

Hi-End: Hierarchical, Endurance-Aware

STT-MRAM-Based Register File for

Energy-Efﬁcient GPUs

WON JEON1, (Student Member, IEEE), JUN HYUN PARK1, YOONSOO KIM2, GUNJAE KOO3,

(Member, IEEE), AND WON WOO RO1, (Senior Member, IEEE)

1Electrical and Electronic Engineering Department, Yonsei University, Seoul, South Korea

2SSD System Engineering Team, NAND Solution Division, SK Hynix, Seongnam, South Korea

3Department of Computer Science and Engineering, Korea University, Seoul, South Korea

Corresponding author: Won Woo Ro (e-mail: wro@yonsei.ac.kr).

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No.

NRF-2018R1A2A2A05018941) and Institute of Information & Communication Technology Planning & Evaluation (IITP) grant funded by

the Korea government (MSIT) (No. 2019-0-00533, Research on CPU vulnerability detection and validation).

ABSTRACT Modern Graphics Processing Units (GPUs) require large hardware resources for massive

parallel thread executions. In particular, modern GPUs have a large register ﬁle composed of Static

Random Access Memory (SRAM). Due to the high leakage current of SRAM, the register ﬁle consumes

approximately 20% of the total GPU energy. The energy efﬁciency of the register ﬁle becomes more

critical as the throughput of GPUs increases. For more energy-efﬁcient GPUs, the usage of non-volatile

memory such as Spin-Transfer Torque Magnetic Random Access Memory (STT-MRAM) as the GPU

SRAM and provides an appropriate read performance. However, using STT-MRAM directly in the GPU

material characteristics. To overcome these challenges, we propose a novel register ﬁle architecture and its

management system for GPUs, named Hi-End, which exploits the data locality and compressibility of the

endurance by caching and data compression, respectively. In our evaluation, Hi-End enhances the energy

efﬁciency of a GPU register ﬁle by 70.02% and reduces the write operations by up to 95.98% with negligible

performance degradation compared to SRAM-based register ﬁles.

INDEX TERMS Graphics processing unit, register ﬁle, spin-transfer torque magnetic random access

memory, data compression, energy efﬁciency, endurance, chip area.

I. INTRODUCTION

GRAPHICS Processing Units (GPUs) have emerged as

the most important computing platform since through-

put applications, such as artiﬁcial intelligence workloads,

became popular. Many emerging applications, such as image

classiﬁcation, audio synthesis, and recommender systems,

are accelerated using GPUs. Because modern GPUs are

deployed from mobile systems to high-performance data

centers, the power consumption of GPUs becomes a critical

issue. A GPU core, called a streaming multiprocessor (SM),

provides a large register ﬁle for massive thread-level paral-

lelism and fast context switching. For instance, the recent

NVIDIA Ampere architecture includes a 27,648 KB register

ﬁle whereas the previous Fermi architecture has a 2,048 KB

performance, the register ﬁle in a GPU is implemented with

Static Random Access Memory (SRAM) which requires

a signiﬁcant amount of power and area. Previous studies

revealed that the GPU register ﬁle consumes 15–20% of

the total energy consumption on NVIDIA GPU devices [3],

[4]. As a modern GPU provides a larger register ﬁle, the

power consumption and the area overhead by the register ﬁle

become more critical.

As Complementary Metal–Oxide–Semiconductor (CMOS)

scales down, the static power consumption by leakage

current occupies a considerable proportion of the entire

power budget. To reduce the static power of memory cells,

researchers have investigated the Non-Volatile Memory

VOLUME 4, 2016 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2020.3008719, IEEE Access

W. Jeon et al.: Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for Energy-Efﬁcient GPUs

(NVM) technologies such as Spin-Transfer Torque Magnetic

Random Access Memory (STT-MRAM). The STT-MRAM

cells provide similar read performance and a signiﬁcantly

low leakage power compared to the conventional SRAM

cells. However, the STT-MRAM cells are affected by inferior

write performance and endurance. The long write latency

and the limited write endurance of STT-MRAM are critical

obstacles preventing it from directly substituting SRAM cells

despite its leakage power advantage. Previous studies pro-

posed architectural solutions to overcome these performance

hurdles of STT-MRAM [4]–[13]. For instance, SRAM-based

write buffers have been utilized to minimize the performance

degradation from STT-MRAM’s long write latency [4], [5].

However, despite their importance, none of the previous

studies have considered the low endurance issues of STT-

MRAM cells used in place of SRAM cells.

In this paper, we propose a novel STT-MRAM-based

Endurance-aware STT-MRAM-based register ﬁle (Hi-End).

Hi-End employs a hierarchical structure and a compression

technique to solve the problems of STT-MRAM. We de-

signed Hi-End based on two observations. The ﬁrst obser-

vation is that most read/write accesses to the GPU register

ﬁle are concentrated to a small number of registers. Hence,

the GPU register ﬁle accesses exhibit a high locality. The

second observation is that most write accesses to the GPU

ﬁle bank activation can be effectively reduced if the data can

be compressed.

To fully exploit the locality of the GPU register ﬁle, we

adopted a register cache and a delay buffer. The register

cache works as a write cache for the slow STT-MRAM

that of the conventional cache except it only caches write-

back data. When the cache block is evicted from the register

cache, the delay buffer operates as a station-like buffer that

maintains the evicted block until it is written to the STT-

MRAM register ﬁle. When evicted data are written to register

ﬁles, data are compressed to minimize the dynamic power

and increase the endurance of STT-MRAM. Since the data

are compressed, we can store the same data with a small

number of register banks. Hi-End utilizes unused register

banks to enhance the endurance of STT-MRAM cells. Parts

of the STT-MRAM register banks are accessed sequentially

to write data. Consequently, the number of write operations

for each STT-MRAM cell is decreased owing to the compres-

sion ratio; hence, endurance is improved accordingly.

The contributions of this paper are summarized as follows:

•By exploiting the locality of the GPU register ﬁle, we

propose a register cache and a delay buffer architecture

to minimize direct access to the STT-MRAM-based

buffer architecture, we can reduce the STT-MRAM

write overhead and take advantage of the STT-MRAM

GPU

L2 Shared Cache (Last-Level Cache)

Memory

Controller 0 Memory

Controller K-1

…

Off-Chip DRAM

SM N-1

Load/Store Unit

CUDA Cores

…

L1 Data Cache

Scratchpad Memory

I-Cache/Fetch/Decode

Warp Scheduler

SM 0

LSU

L1D$

Scrat

I$/F/D

SM 1

LSU

…

L1D$

Scrat

I$/F/D

Warp Scheduler

Bank Arbiter

Operand Collector

CUDA Cores

Bank 1

Bank 0

Bank M-1

…

Bank 2

CoresCores

FIGURE 1. Baseline GPU and register ﬁle architecture.

•We propose a compression technique to employ the

value similarity of GPU register ﬁle and decrease the

high dynamic power of the STT-MRAM register ﬁle.

Furthermore, bank-level wear-leveling utilizes concen-

trated bank access to increase endurance with low-cost

architecture.

•With cycle-accurate simulations, we show that Hi-End

reduces the energy consumption of GPU register ﬁles

by 70.02% with a 0.86% of performance drop compared

to SRAM-based register ﬁle GPUs. Hi-End reduces

the number of write operations to the most frequently

written register ﬁle bank by 95.98%.

The rest of this paper is organized as follows. Section II

presents the baseline GPU architecture, register ﬁle orga-

nization, and characteristics of STT-MRAM. Section III

introduces the motivational data and opportunities for our

study. In Section IV, we present the detailed architecture and

operation of Hi-End. The experimental results of Hi-End,

such as the performance and energy efﬁciency, are described

in Section V. In Section VI, we introduce related studies, and

we conclude our paper in Section VII.

II. BACKGROUND

A. BASELINE GPU ARCHITECTURE

In this paper, we use NVIDIA GPU terminologies for con-

sistency [1], [2], [14]–[18]. Our baseline GPU model com-

prises several SMs, and each SM contains dozens of cores;

therefore, the GPU device contains hundreds to thousands of

computing cores. An SM can execute thousands of threads

at a time (e.g. 2,048 in recent GPUs). To provide sufﬁcient

memory bandwidth to the massive number of concurrently

executing threads, GPUs exploit a large number of register

ﬁles. Consequently, the register ﬁle occupies a large portion

of the entire on-chip memory in GPUs.

Fig. 1 describes the baseline GPU and the register ﬁle

architecture used in this work. The on-chip memory systems

of GPUs have hierarchical structures with a register ﬁle, L1

data cache, scratchpad memory (or shared memory), and L2

2VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2020.3008719, IEEE Access

W. Jeon et al.: Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for Energy-Efﬁcient GPUs

0.2

0.4

0.6

0.8

2DCONV

2MM

3DCONV

3MM

BFS

BICG

CONV

FDTD

LBM

LIB

MUM

AVG

Energy Consumption Ratio

Leakage Dynamic (read) Dynamic (write)

FIGURE 2. Energy consumption of SRAM-based GPU register ﬁle.

shared cache (or last-level cache). The register ﬁle is used

as operands for instructions and additional memory space

for context switching. The L1 data cache and the scratchpad

memory are located inside of each SM and provide private

cache memory space. In particular, the scratchpad memory

stores user-deﬁned data for higher throughput. The memory

space for the L1 data cache and the scratchpad memory are

often exchangeable. Finally, the L2 shared cache is located

outside of the SMs and provides shared memory space for

all threads operating on a GPU. The L2 shared cache is

connected with an off-chip main memory system through

memory controllers.

B. GPU REGISTER FILE ARCHITECTURE

Conventional GPUs provide large SRAM-based register ﬁles.

As shown in Fig. 1, the register ﬁle in a single SM is

composed of multiple banks in order to provide parallel bank

access and higher bandwidth. The baseline GPU model used

in this study employs a 128 KB register ﬁle composed of 64

banks. Each register ﬁle bank contains 256 entries, and each

entry is 64-bit wide; hence, a single entry stores two 32-bit

data. A GPU hardware executes multiple threads in a group,

known as a warp, which comprises 32 threads that share

the same program counter and execute identical instructions

simultaneously. To access multiple registers concurrently, the

registers for 32 threads in the same warp are allocated to the

same entry location that exists across multiple banks [19]. For

instance, assuming the size of a register r0 is 4 bytes, r0 of

32 threads in warp0 can be stored in entry0 of register

banks from 0 to 15. Hence, multiple threads in a single warp

access tens of registers simultaneously to minimize clock

cycles required to access the register ﬁle. However, a large

amount of dynamic and static power is dissipated in the

of register ﬁle banks [19].

C. CHARACTERISTICS OF STT-MRAM

High leakage current in SRAM-based memory cells is the

major source of large power dissipation in the GPU register

ﬁle. As shown in Fig. 2, approximately 59.6% of the total

GPU register ﬁle energy is consumed by the leakage current

in the SRAM cells under 32 nm CMOS technology. De-

tailed simulation methodologies for the ﬁgure are provided

in Section V. Using the emerging NVM cells is a prominent

approach, as most NVM technologies exhibit extremely low

leakage current. For instance, STT-MRAM provides a similar

read lartency with a smaller cell size and an extremely low

leakage current compared to SRAM-based memory cells.

However, NVM technologies still have drawbacks in terms of

write performance and endurance. Other NVM technologies

such as Phase Change Memory (PCM) and Resistive Ran-

dom Access Memory (ReRAM) exhibit drawbacks, such as

extremely low endurance or long access latency; therefore,

they are not suitable for use as fast on-chip memories [20]. In

this work, we use STT-MRAM to build an energy-efﬁcient

TABLE 1. Characteristics of SRAM and STT-MRAM [4], [5], [21]

Parameter SRAM STT-MRAM

Cell factor (F2) 146 57.5

Area (mm2) 0.194 0.038

Read latency (cycle) 1 1

Write latency (cycle) 1 4

Read energy (pJ/bit) 0.203 0.239

Write energy (pJ/bit) 0.191 0.300

Leakage power (mW) 248.7 16.2

Endurance 1016 1013

We compare the characteristics of STT-MRAM with the

conventional SRAM cells in Table 1. The parameters are

measured using NVSim [4], [5], [21]. More detailed ex-

perimental methodologies are provided in Section V. The

endurance parameter of STT-MRAM is conﬁgured based on

the actual usage environment [22]. Even though STT-MRAM

exhibits the same read latency, the longer write access latency

(four times slower) and higher write energy consumption are

critical disadvantages of STT-MRAM cells. Another critical

drawback of STT-MRAM is its low write endurance, which is

1,000 times lower compared to that of SRAM. In particular,

the register ﬁle is the memory space that is closest to the

computational unit; therefore, data read and write occurs

most frequently. Consequently, the low write endurance of

STT-MRAM results in the short lifetime of the entire GPU

system.

To investigate the effect of STT-MRAM-based register

ﬁles on GPU performance and lifetime, we conﬁgured the

STT-MRAM parameters listed in Table 1 into a cycle-

accurate GPU simulator [23]. In the study, we replaced the

SRAM-based register ﬁle in the GPU with an STT-MRAM-

based register ﬁle without any additional techniques. Our

simulation study reveals that the performance decreases by

18% due to slower register ﬁle write operations, and the GPU

lifetime becomes only 11 months due to the limited register

ﬁle write endurance. Both the performance drop and short

lifetime are impractical for GPUs.

VOLUME 4, 2016 3

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2020.3008719, IEEE Access

W. Jeon et al.: Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for Energy-Efﬁcient GPUs

0.2

0.4

0.6

0.8

100

Fermi

(GF100)

Kepler

(GK110)

Maxwell

(GM200)

Pascal

(GP100)

Volta

(GV100)

Turing

(TU102)

Ampere

(GA100)

On-Chip Memory Size (MB)

FIGURE 3. On-chip memory size of NVIDIA GPUs by generations.

L1 Data Cache +

Scratchpad Memory L2 Cache Register File

On-Chip Memory Size

per Shader Unit (KB)

Fermi Kepler Maxwell Pascal Volta Turing Ampere

FIGURE 4. On-chip memory size per shader unit.

III. MOTIVATION

As introduced in the previous section, low write performance

and endurance are critical obstacles hindering the usage of

STT-MRAM cells as on-chip memory structures, such as

high energy consumption problem of GPU register ﬁle, and

analyze the unique patterns of register ﬁle read and write

operations in GPU applications. In addition, we analyze the

similarities in the register data accessed by adjacent threads

within the same warp. Furthermore, we demonstrate that

the register ﬁle data can be effectively compressed to a

smaller size by exploiting the data similarity observed in the

of architectural solutions that can employ STT-MRAM to

achieve an energy-efﬁcient register ﬁle design.

A. ENERGY CONSUMPTION OF GPU REGISTER FILE

To achieve a high throughput, GPUs execute hundreds of

thousands of threads concurrently. For example, the recent

NVIDIA Ampere GA100 executes up to 221,184 threads

in one GPU device [2]. The large number of register ﬁles

has been a key enabler for the high throughput of GPUs.

We analyzed the on-chip memory size and a portion of the

[14]–[18]. As shown in Fig. 3, the register ﬁle constituted

approximately 60% of the total on-chip memory. Despite

SRAM’s high energy consumption, the register ﬁle occupies

a large portion of the on-chip memory system. The relatively

0.2

0.4

0.6

0.8

2DCONV

2MM

3DCONV

3MM

BFS

BICG

CONV

FDTD

LBM

LIB

MUM

AVG

Access Ratio

Top-5 Most Written Register Top-5 Most Read Register

FIGURE 5. Read and write ratios of Top-5 most accessed registers.

low register ﬁle portion (approximately 30.1%) of Ampere

is due to a substantial increase in the L2 caches. The sizes of

L2 caches in NVIDIA GPUs are increased for workloads that

have non-cacheable dataset for conventional L2 caches, such

as artiﬁcial intelligence and high performance computing

workloads.

The register ﬁle size of NVIDIA GPUs has increased

steadily. Fig. 4 shows the on-chip memory size per shader

unit (or CUDA core). Except the unique L2 cache in Ampere,

the register ﬁle size per shader unit is always the largest

among various on-chip memory structures. In fact, the regis-

ter ﬁle size per shader unit has remained at 4 KB since Pascal

GPUs. The number of computing cores in a GPU device will

continue to increase to achieve higher throughput. Hence, the

ever, large SRAM-based register ﬁles consume a substantial

amount of static energy, as presented in Section II-C. As

a result, the energy consumption of register ﬁles in GPUs

becomes an increasingly important issue.

B. LOCALITY IN GPU REGISTER FILE

We discovered that the register ﬁle accesses possess locality

based on the register IDs. To reveal this register locality,

we measured the read and write access counts by register

IDs using a cycle-accurate GPU simulator [23]. Detailed

simulation methodology is provided in Section V. For in-

stance, if an instruction requires r0 and r1 as operands

and writes the execution result to r2, we increment the

read counts of r0 and r1 and the write count of r2 by

1, respectively. Subsequently, we sorted the register access

counts based on the register IDs. This measurement reveals

two important characteristics in GPU register ﬁle accesses.

First, most register ﬁle accesses are concentrated on several

accesses are from the Top-5 most written register IDs as

shown in Fig. 5. Second, the write access counts from these

that high locality and concentration are observed in GPU

4VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2020.3008719, IEEE Access

W. Jeon et al.: Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for Energy-Efﬁcient GPUs

Baseline New Structure for Hi-End

Operand Collectors

Writeback

Warp Scheduler

Delay Buffer

Compression Unit

Decompression Unit

Crossbar

Issue

Delay buffer hit

Operand information

SIMD Execution Units

…

STT-MRAM Register File

…

Bank 0

(256 x 64-bit)

Bank 1

Bank 2

Bank 63

WID

Reg ID

Operands

Reg ID

Operands

Reg ID

Operands

WID

Reg ID

Operands

Reg ID

Operands

Reg ID

Operands

WID

Reg ID

Operands

Reg ID

Operands

Reg ID

Operands

Bank Arbiter

Compression Indicator

Bank-level Wear-leveling

FIGURE 6. Hi-End architecture.

C. DATA SIMILARITY IN GPU REGISTER FILE

The other characteristics observed in the GPU register ﬁle

data is that the data written by the adjacent threads within

the same warp have similar values. Hence, this type of data

can be easily compressed using simple compression methods,

such as the Base-Delta-Immediate (BDI) algorithm [24]. The

BDI compression method can perform fast and simple com-

pression and decompression processes. A BDI compressor

extracts two parameters, called base and delta, from multiple

sets. The data to be compressed are split into multiple pieces,

and the ﬁrst piece is set as the base. Subsequently, the

difference from this base is computed for other pieces. These

differences are represented as delta for other values. For

instance, 128-byte register data composed of 32 values of 4-

byte can be compressed to 35-byte data when BDI compres-

sion is applied using 4-byte base and 1-byte delta resolutions.

Furthermore, in case all 32 values are the same, the 128-

byte register data can be compressed to 4-byte, as the delta

is 0-byte. Hence, by applying BDI compression, read/write

accesses for multiple memory banks can be reduced.

The original BDI algorithm exploits multiple bases and

corresponding deltas to achieve optimal compression ratio.

However, in this study, we ﬁxed the base resolution to 4-byte

to simplify the hardware design. Furthermore, we restricted

the deltas to only have 0-byte, 1-byte, and 2-byte resolutions.

By applying these limitations we can minimize clock cycles

and power consumption of the compression and decompres-

sion units. In this work, if the data cannot be compressed us-

ing such ﬁxed base and delta resolutions, the data will remain

uncompressed. We calculated the ratio of register write that

can be compressed using the BDI algorithm to demonstrate

the value similarity of the GPU register ﬁle. In fact, more

than 62% of the register write can be compressed using the

BDI algorithm. Furthermore, recent studies revealed that the

compression algorithm [4], [19].

IV. HIERARCHICAL, ENDURANCE-AWARE STT-MRAM

We propose an energy-efﬁcient STT-MRAM-based register

ﬁle for GPUs. Our proposed register ﬁle architecture exploits

STT-MRAM as a primary memory unit for register data

to reduce the static power dissipation. To overcome STT-

MRAM’s poor write performance, we apply a hierarchical

structure that employs a small SRAM as a cache for the

STT-MRAM register ﬁle. It is effective because of the strong

locality observed in the GPU’s register ﬁle data, as men-

tioned in Section III-B. In addition, we propose an efﬁcient

wear-leveling mechanism using register data compression

to extend the lifetime of STT-MRAM cells. The detailed

architecture will be described in the following subsections.

A. OVERALL ARCHITECTURE OF HI-END

In this section, we describe the detailed architecture of the

proposed STT-MRAM-based hierarchical, endurance-aware

End is depicted in Fig. 6. Hi-End includes additional com-

ponents (register cache,delay buffer,compression unit, and

bank-level wear-leveling) on the baseline GPU register ﬁle.

The newly added blocks are shown in gray in the ﬁgure. Hi-

End employs a hierarchical structure that exploits an SRAM-

based write cache to hide the long write latency of the STT-

MRAM register ﬁle and reduce write counts to the STT-

MRAM cells. The SRAM-based write cache works as a

gateway for register writes to the STT-MRAM register ﬁle.

Hence, for register writes, the register values are ﬁrst written

to the register cache. The GPU’s register data exhibits strong

locality (see Section III-B); therefore, the frequently written

Once the cached data are evicted from the register cache, the

evicted data are compressed using the compression unit. As

the size of the register data is reduced by the compression

step, the register data can occupy less banks of the STT-

VOLUME 4, 2016 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2020.3008719, IEEE Access

W. Jeon et al.: Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for Energy-Efﬁcient GPUs

Write Request

Evicted from cache

(b) Hi-End write operation.

Read Request

(a) Hi-End read operation.

Cache Hit Cache Miss

Delay Buffer

Evicted Block

Write Access

Cache Hit Cache Miss

Delay Buffer

Buffer MissBuffer Hit

Read Access

FIGURE 7. Read and write operations in Hi-End.

MRAM register ﬁle. Furthermore, Hi-End applies bank-level

wear-leveling (BWL) to increase the lifetime of the STT-

MRAM cells. BWL equally distributes the write requests

to multiple banks of the STT-MRAM register ﬁle. Hi-End

employs the delay buffer between the register cache and the

bank arbiter to compensate the additional cycles from the

compression unit and slow STT-MRAM write operation. We

will describe the detailed structures and mechanisms of Hi-

End components in the following sections.

Hierarchical register ﬁle: Because Hi-End exploits the

hierarchical register ﬁle, the requested register data can be

found in the various levels of the memory units (register

cache, delay buffer, and STT-MRAM register ﬁle). Fig. 7

summarizes the read and write operations in Hi-End.

The GPU utilizes the operand collector to read register

data from the uniﬁed register ﬁle. When a warp issues an

instruction that accesses register data, the warp scheduler

passes operand information required for register ﬁle accesses

(see Fig. 6). Each operand collector contains the operand

information, such as warp IDs, valid ﬂags, register IDs, and

ready ﬂags. Such operand information is sent to Hi-End, to

access register data from the hierarchical register ﬁle.

For a read operation, Hi-End ﬁrst searches the requested

data in the register cache. Since the register cache has an

SRAM-based high performance cache structure, Hi-End can

quickly service the register data to the operand collector if

the requested data is found in the register cache. Second, the

requested data can be found in the delay buffer if the data is

evicted from the register cache and not written completely

to the STT-MRAM register ﬁle. When the requested data

does not exist in both the register cache and the delay buffer,

the data is read from the STT-MRAM register ﬁle. In this

case, if the data stored in the compressed form, the bank

arbiter activates and accesses the smaller number of banks.

The operand collector obtains the original data form via the

decompression unit. Due to the hierarchical structure, the

the target data is in the STT-MRAM register ﬁle. However,

as most read requests are handled in the register cache, the

read delay is negligible in Hi-End. A detailed analysis on the

Tag Data

11-bit 1024-bit

Modulo

256

Adder

Reg ID

(5-bit)

Warp ID

(6-bit) Shift

Left 5

Index

Tag

Hit

11-bit

Tag Calculation

Tag Matching

FIGURE 8. Register cache structure.

effect of read latency on GPU performance is presented in

Section V-C.

Once the Single Instruction Multiple Data (SIMD) exe-

cution units complete arithmetic operations, a register write

request occurs. Hi-End ﬁrst searches the register cache for the

write request, and when a cache hit occurs, the target cache

line is updated with the new value. Otherwise, the write-

back data is newly allocated in the register cache. The data

evicted from the register cache must be stored in the STT-

MRAM register ﬁle if the register cache is already full with

the data that was previously written back; hence, the evicted

data enters the delay buffer. Subsequently, the evicted data

is compressed by the compression unit and written to the

STT-MRAM register ﬁle. This mechanism is similar to the

conventional cache write-back policy that employs the store

buffer except that every evicted data is written to the slower

STT-MRAM register ﬁle. Once the data is compressed, the

bank arbiter selects the banks to be activated for register write

operation. Hi-End applies the BWL technique in this step

to improve endurance of STT-MRAM cells. The long write

latency of STT-MRAM can cause read-after-write hazards

and pipeline stalls in SMs. In Hi-End, with the support of

the register cache and delay buffer, a register write request

instantly obtains a cache line in the register cache; hence, the

write delay is negligible.

B. DETAILED ARCHITECTURE OF HI-END COMPONENT

In this subsection, we discuss the detailed architecture and

operations of each structure in Hi-End, such as the register

cache,delay buffer,compression unit, and bank-level wear-

leveling.

fast, and temporary cache memory for write-back data. The

simplify the hardware complexity. While the write perfor-

mance of STT-MRAM is much slower than that of SRAM,

the read performance of STT-MRAM is similar to that of

SRAM. Using this characteristic, the register cache in Hi-End

only allocates write data, unlike the general cache memory.

A read request can be quickly serviced from the STT-MRAM

6VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2020.3008719, IEEE Access

W. Jeon et al.: Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for Energy-Efﬁcient GPUs

1 1 7 $r1 17 0x2E3C ..... 32

Valid

(1-bit) Allocate

(1-bit) Counter

(3-bit) Register ID

(5-bit) Warp ID

(6-bit) In-write Data

(1024-bit)

16 Entries

....

Evicted from

Write data to

STT-MRAM

Remove

entry

Read data from

other instructions

FIGURE 9. Delay buffer structure.

request does not require the support of the register cache.

Furthermore, requesting the previously written-back register

data is more probable than reusing the data previously read

from the register ﬁle. Hence, the register cache effectively

increases the write hit ratio by allocating only write data.

Using this approach, Hi-End can reduce the number of write

requests to the STT-MRAM register ﬁle, thereby increasing

the lifetime of the STT-MRAM cells. In this work, the

lines. As shown in Fig. 8, a single cache line includes a 1-bit

valid ﬂag and a 11-bit tag ﬁeld. Unlike conventional cache

tags that use part of an address ﬁeld, the tags in the register

cache utilizes the combination of a 6-bit warp ID and a 5-bit

logical shifter and an adder.

Delay buffer: The role of the delay buffer is similar to

that of the store buffer in the conventional cache architecture.

The delay buffer temporarily stores the dataset evicted from

the register cache before the dataset are written to the STT-

MRAM register ﬁle. This buffer space is required for the

evicted data since several cycles are taken for data com-

pression and write operation to STT-MRAM cells. Note that

the write performance of STT-MRAM used in this work is

four times slower than that of SRAM cells, as explained in

Section II-C. In this work, we conﬁgure the delay buffer

to include 16 entries, as shown in Fig. 9. Our simulation

study reveals that this depth is sufﬁcient to minimize the

structural hazards in the delay buffer considering the cycles

required by the compression unit and the STT-MRAM writes.

As mentioned previously, the data read requests can be read

from the delay buffer. This feature allows Hi-End to read

the data while they are being compressed, thereby hiding

compression latency. In addition, if the delay buffer does

not support the read operation, the operand collector can

obtain incorrect data from the STT-MRAM register ﬁle due

to the redundant write delay. This is because, during the time

between data leaves the register cache and is written to the

STT-MRAM register, no corresponding register exists in the

To support the read operation of the delay buffer, each en-

128-byte Original Data

4-byte Base

32-bit

Subtractor 32-bit

Subtractor

𝛥0𝛥1𝛥2𝛥30

Sign Extension

Comparator Sign Extension

Comparator

⋯

Compressible?

⋯

4-byte Base 𝛥0𝛥1𝛥2𝛥3𝛥30 Yes

Packing Data No

Uncompressed Data OutCompressed Data Out

FIGURE 10. Implementation of the compression unit [19].

try of the delay buffer has additional control ﬁelds as shown

in Fig. 9. These control ﬁelds include a 1-bit valid ﬂag, 1-bit

allocate ﬂag, 3-bit counter, 5-bit register ID, 6-bit warp ID,

and 1024-bit register data. While other ﬁelds contain general

information, the allocate ﬂag and the counter contain unique

information for STT-MRAM. These two ﬁelds are used to

validate the completions of write operations. In case the

polling-based approach is used for the validation, every entry

of the delay buffer and STT-MRAM register bank should

communicate with each other, and the completion of a write

process must be veriﬁed in every cycle. Hi-End employs

the counter-based approach to avoid such an overhead. The

allocate ﬂag indicates whether the corresponding entry is

currently writing data into the STT-MRAM register ﬁle. The

counter ﬁeld contains the number of cycles remaining to

complete the write process of the corresponding entry. The

counter is set to 6 when a write process is initiated and

decreased by 1 while the allocate ﬂag is 1. In detail, writing

data to the STT-MRAM register ﬁle requires two cycles for

compression and four cycles of STT-MRAM write latency;

hence, six cycles are required in total, and the counter must

be set accordingly.

Compression unit: We implement a compression unit

using the BDI algorithm used in warped-compression [19].

In the BDI algorithm, the compression ratio can be improved

by supporting various conﬁgurations for the base and delta.

However, a higher number of options for the base and delta

results in a higher hardware overhead; hence, we limit the

number of possible sizes of the base and delta. The size of

base is ﬁxed to 4-byte, whereas the size of delta can be 0-

byte, 1-byte, or 2-byte. Based on our observations shown in

Section III, most GPU register ﬁle data can be efﬁciently

compressed using these three options for delta. Therefore, the

compression unit in Hi-End supports only three compression

options to simplify the hardware design. The hardware ar-

chitecture of the compression unit is described in Fig. 10.

VOLUME 4, 2016 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2020.3008719, IEEE Access

W. Jeon et al.: Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for Energy-Efﬁcient GPUs

4 warp register write operations

(a) Baseline

(b) Warped-Compression

FIGURE 11. Example of bank-level wear-leveling operation in Hi-End

compared to the baseline register ﬁle and warped-compression [19].

Hi-End can reduce the dynamic power for accessing the

STT-MRAM register ﬁle by compressing the data size. In

addition, we can reduce the number of activated banks in the

STT-MRAM register ﬁle when storing compressed data. For

instance, the compressed data will activate only 5 banks if

the data size is reduced by a 4-byte base and a 1-byte delta,

whereas the uncompressed 128-byte data accesses 16 banks

in the register ﬁle. Hence, Hi-End can effectively reduce the

access counts to STT-MRAM cells, and prolong the lifetime

of the register ﬁle based on BWL.

Bank-level wear-leveling: To prolong the lifetime of

the STT-MRAM cells, Hi-End applies BWL by exploiting

the advantages of the compressed data. Hi-End can reduce

the overall write counts for the STT-MRAM banks using

a hierarchical structure. However, some register banks may

have a shorter lifetime if the compressed data is written more

frequently to several speciﬁc banks. To avoid such cases, Hi-

End applies the wear-leveling mechanism, which can balance

the write accesses among multiple register ﬁle banks.

Fig. 11 presents an illustrative example of BWL of Hi-End

compared to the baseline register ﬁle without compression

and warped-compression [19]. Without using the compres-

sion technique, each register data is written across multiple

banks, as shown in Fig. 11(a). In the baseline model, all

banks are affected by the high write count, which results in

a short lifetime when the register ﬁle is implemented with

STT-MRAM. Warped-compression compresses register data,

uses less number of banks, and power gates the unused banks

to reduce leakage power, as shown in Fig. 11(b). However,

in case of STT-MRAM, power gating has a less prominent

effect compared to the case of SRAM because the leakage

power of STT-MRAM is extremely small. Furthermore, al-

though the lifetime of the unused banks can be prolonged, the

banks that frequently store compressed data are still affected

by high write count and short lifetimes. The lifetime of a

device is determined by the structure with the shortest life-

time. Consequently, the lifetime of the STT-MRAM-based

of the baseline register ﬁle. As shown in Fig. 11(c), Hi-

End compresses register data and stores them into multiple

TABLE 2. Simulation Parameter

Parameters Value

GPU Microarchitectural Parameter

Process 32 nm

Core clock frequency 700 MHz

SMs/GPU 15

Warp size 32-threads

Warp scheduling policy Greedy-Then-Oldest

Number of register banks/SM 64

Maximum number of warps/SM 48

Maximum number of threads/SM 1,536

Maximum number of registers/SM 32,786

Bit width/bank 64-bit

Number of entries/bank 256

Hi-End Architecture Parameter

Number of blocks in register cache 256

Number of entries in delay buffer 16

Delay buffer size 2.03 KB

Compression unit energy/activation (pJ) 23

Compression unit leakage power (mW) 0.12

Decompression unit energy/activation (pJ) 21

Decompression unit leakage power (mW) 0.08

banks to avoid write concentration which occurs in warped-

compression. Using the wear-leveling technique, Hi-End can

decrease the write concentration and improve the lifetime of

the GPU device.

The Compression Indicator Table (CIT) in the bank arbiter

contains a compression range that indicates the compression

method used in the previous compression operation. The ar-

biter determines the number of banks to access by referring to

the compression range because the number of banks to access

changes based on the compression method. Furthermore, the

CIT contains the Bank Point Value (BPV), which indicates

the next bank ID of previously compressed data accessed. In

the write bank allocation process, the bank arbiter sends a

BPV to the compression unit with the data. The compression

unit identiﬁes the bank ID to start a write operation using the

BPV. While the compression unit writes data to the register

ﬁle, it concurrently updates the CIT with a new BPV and

compression range. In the read operation, the decompression

unit can identify the number of banks to read by referring

to the compression range as well as identify the bank ID to

start the access process by subtracting the numbers of banks

to access from the BPV.

V. EVALUATION

A. METHODOLOGY

We evaluated our proposed architecture using cycle-accurate

simulator, GPGPU-Sim version 3.2.2 [23]. The detailed GPU

hardware parameters are shown in Table 2. The parameters

of the STT-MRAM and SRAM register ﬁles shown in Ta-

ble 1 are evaluated using circuit-level simulator, NVSim,

and conﬁgured from previous studies [4], [21]. The param-

eters of the compression/decompression unit are conﬁgured

from a previous study [19]. The detailed parameters of Hi-

8VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2020.3008719, IEEE Access

W. Jeon et al.: Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for Energy-Efﬁcient GPUs

0.2

0.4

0.6

0.8

1.2

2DCONV

2MM

3DCONV

3MM

BFS

BICG

CONV

FDTD

LBM

LIB

MUM

AVG

Normalized Register File Energy

SRAM STT STT+WB+BDI Hi-End

FIGURE 12. Normalized register ﬁle energy consumption.

0.2

0.4

0.6

0.8

1.2

2DCONV

2MM

3DCONV

3MM

BFS

BICG

CONV

FDTD

LBM

LIB

MUM

AVG

Normalized GPU IPC

SRAM STT STT+WB+BDI Hi-End

FIGURE 13. Normalized GPU IPC.

End, such as the register cache, delay buffer, and compres-

sion/decompression unit are shown in Table 2.

We selected 17 benchmarks from Parboil [25], CUDA

Software Development Kit [26], Rodinia [27], Poly-

Bench [28], and GPGPU-Sim benchmark suite [23]. Using

the benchmarks and architectural conﬁguration above, we

compared the following four different architectures:

•SRAM: GPU architecture with SRAM register ﬁle.

•STT: GPU architecture with STT-MRAM register ﬁle

without additional techniques.

•STT+WB(Write Buffer)+BDI: STT-MRAM-based reg-

ister ﬁle which includes a centralized write buffer and

BDI compression from the previous study [4].

•Hi-End: The proposed architecture which includes

a register cache, a delay buffer and a compres-

sion/decompression unit with the BWL technique.

B. ENERGY EFFICIENCY

The normalized energy consumption of the register ﬁle is

shown in Fig. 12. Each data is normalized to the energy

consumption of the baseline SRAM-based register ﬁle. Using

the STT-MRAM directly with the register ﬁle instead of

using SRAM can reduce the register ﬁle energy consumption

by 43.98%, but it causes a 17.36% degradation in the GPU

Instructions per Cycle (IPC), as shown in Fig. 13. In several

benchmarks such as 3DCONV,BFS,GS,LBM, and ND,

STT shows relatively low energy consumption compared to

other benchmarks. As presented in Fig. 2 in Section II, the

SRAM-based register ﬁle energy in such benchmarks are

largely consumed by the SRAM leakage energy. In particular,

the leakage energy in BFS and GS are over 80% of the

total register ﬁle energy consumption. High leakage energy

consumption can be effectively decreased by directly using

the STT-MRAM without additional techniques. Furthermore,

as the leakage energy ratio increases, the read and write

energy ratio decreases, which results in the lower effect

of the additional techniques in Hi-End. Consequently, STT

shows the similar or the lowest energy consumption in such

benchmarks.

In the previous work (STT+WB+BDI), both the write

buffer and the STT-MRAM register ﬁle handle read requests

simultaneously, thereby requiring additional dynamic energy.

Hi-End accesses the STT-MRAM register ﬁle only if the

read request cannot obtain the data in the upper hierarchy,

such as the register cache or the delay buffer; hence, the

process is more energy-efﬁcient. Hi-End obtains register ﬁles

in an energy-efﬁcient manner owing to the following two

factors. The ﬁrst involves the lower leakage current with

VOLUME 4, 2016 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2020.3008719, IEEE Access

W. Jeon et al.: Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for Energy-Efﬁcient GPUs

0.2

0.4

0.6

0.8

2DCONV

2MM

3DCONV

3MM

BFS

BICG

CONV

FDTD

LBM

LIB

MUM

AVG

Read Access Ratio

FIGURE 14. Read access ratio of Hi-End memory structures.

STT-MRAM. Since STT-MRAM is an NVM and does not

require a constant power supply, it affords almost no leak-

age current compared to SRAM. The second involves the

usage of the compression unit. STT-MRAM consumes high

dynamic power for read and write operations, as presented

in Table 1. Meanwhile, Hi-End reduces the number of bits

for read and write operations by compressing data for STT-

MRAM, thereby reducing the dynamic power. As a result,

Hi-End reduces 70.02% of the register ﬁle energy, whereas

STT+WB+BDI reduces only reduces 58.79% of the energy

compared to the SRAM-based register ﬁle.

C. IPC PERFORMANCE

Fig. 13 shows the GPU IPC performance normalized to

the baseline SRAM-based register ﬁle. Directly using STT-

MRAM with the register ﬁle instead of using SRAM de-

creases the performance by 17.36% due to the long write

latency of STT-MRAM. In particular, BS,HW,MUM, and

SC are affected more in STT in terms of both energy and

performance. As presented in Fig. 2 in Section II, these

applications have a large portion of read and write energy

consumption, which means the number of read and write

requests are larger than other benchmarks. As the number of

read and write increases, STT cannot fully cover the in-ﬂight

requests due to the low write performance; hence, results in

performance degradation. The previous work that adopts WB

covers some of the problems and improves the IPC compared

to STT. However, merely using WB cannot fully hide the

long write latency of STT-MRAM; therefore, a performance

degradation of 8.12% is observed.

Fig. 14 shows the ratio of read access to the different

memory structures in Hi-End, such as the register cache,

STT-MRAM, and delay buffer. In Hi-End, 85.40% of the

read operations are served from the register cache, 14.45%

from the STT-MRAM-based register ﬁle, and 0.15% from

the delay buffer. Despite its low read access ratio, the delay

buffer is an essential component, as it prevents incorrect

data reads as explained in Section IV-B. As reads from the

delay buffer and the STT-MRAM with the decompression

0.2

0.4

0.6

0.8

2DCONV

2MM

3DCONV

3MM

BFS

BICG

CONV

FDTD

LBM

LIB

MUM

AVG

Normalized Bank Access

STT-MRAM Hi-End without BWL Hi-End

FIGURE 15. Normalized access count of the most accessed register bank.

unit require two and four cycles, respectively, the average

read latency is 1.43 cycles. By employing the register cache,

Hi-End has one cycle write latency, as all write accesses

are performed in the register cache. As a result, the overall

latency increase is only 0.43 read latency. This latency can

be hidden by warp scheduling or using other techniques

to hide short latency. Consequently, the evaluation results

demonstrate that Hi-End can achieve 99.14% in terms of

performance compared with the SRAM-based register ﬁle.

D. ENDURANCE

Despite the endurance of STT-MRAM being one of the

most critical issues in using STT-MRAM, it was considered

insigniﬁcant in previous studies [4]–[6]. In Hi-End, we im-

prove the endurance of STT-MRAM using two architectural

techniques, the register cache and BWL. Fig. 15 shows the

access count of the most accessed register bank normalized

to the directly used STT-MRAM register ﬁle. By reducing the

write access to the STT-MRAM register ﬁle using the register

cache, Hi-End decreases 90.25% of the write access to the

STT-MRAM register ﬁle compared to directly using the STT-

MRAM register ﬁle. However, the concentrated bank access

provides additional opportunities to improve the endurance

of STT-MRAM. Using BWL technique, Hi-End distributes

concentrated write accesses of compressed data, thereby

deceasing an additional 58.78% of bank accesses to the most

written bank compared to Hi-End without BWL. As a result,

Hi-End decreases 95.98% of the write accesses in the most

written bank compared to directly using the STT-MRAM

E. AREA ANALYSIS

To analyze the area of Hi-End, we used NVSim for the

area simulation of the STT-MRAM-based register ﬁle,

sion/decompression unit are is evaluated based on a previous

study [19]. Table 3 shows the relative area of Hi-End by

structures compared to the baseline SRAM-based register

ﬁle. Hi-End reduces 8.19% of the area compared to the

10 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2020.3008719, IEEE Access

W. Jeon et al.: Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for Energy-Efﬁcient GPUs

TABLE 3. Relative Area of Hi-End Compared to SRAM-based Register File

Hi-End Component Area Ratio (%)

Compression/decompression unit 36.08

STT-MRAM register ﬁle 19.59

Delay buffer 5.59

Area saving 8.19

Total 100

baseline SRAM-based register ﬁle despite the additional

structures in Hi-End. The main reason for area saving of Hi-

End is the high density of STT-MRAM compared to that

of SRAM; 128 KB of the STT-MRAM register ﬁle only

occupies 19.59% of the area compared to 128 KB of the

SRAM register ﬁle. Each register cache, delay buffer, com-

pression/decompression unit, and STT-MRAM-based regis-

ter ﬁle occupies 30.55%, 5.59%, 36.08%, and 19.59% of the

SRAM-based register ﬁle, respectively, thereby resulting in

an 8.19% area reduction.

VI. RELATED WORK

Hi-End aims to provide higher energy efﬁciency, lower per-

formance degradation, and higher endurance for GPUs with

STT-MRAM-based register ﬁle. To the best of our knowl-

edge, this work is the ﬁrst that investigated compositive

optimization techniques of register ﬁle cache, compression,

and wear-leveling for STT-MRAM-based register ﬁle. In this

section, we discuss previous techniques that are related to our

study, such as optimization techniques for GPUs with STT-

MRAM register ﬁle and GPU register ﬁles with compression

techniques.

tempted to solve the power consumption of the GPU register

ﬁle by substituting SRAM to STT-MRAM. However, directly

using STT-MRAM in a GPU register ﬁle instead of SRAM

is challenging due to the drawbacks. Several previous works

attempted to solve the challenges of STT-MRAM with archi-

tectural approaches.

Li et al. proposed a hybrid STT-MRAM register ﬁle archi-

tecture [5]. They adopted bank-level write buffer and warp-

aware write back techniques to overcome the performance

drop of STT-MRAM-based register ﬁles. They employed two

SRAM buffers for every STT-MRAM register ﬁle. When

a warp is active, one SRAM buffer is used to write back

registers, and the next active warp uses the other SRAM

buffer. The two buffers provide services alternately between

the two warps, thereby hiding the long latency stall of STT-

MRAM. However, in this approach, the entire pipeline of the

GPU can be stalled when the active period of one warp is

not adequate to write back data from the SRAM write buffer

to the STT-MRAM register ﬁle. Furthermore, such per bank

write buffer designs cannot fully utilize SRAM resources

when the write requests are unbalanced.

To address the limitations of the previous approach [5],

Zhang et al. proposed centralized write buffer [4]. They

employed an SRAM buffer in the bank arbiter, and the buffer

is shared by all register banks. Furthermore, they adopted

the compression technique to decrease the dynamic power

consumption of STT-MRAM. However, in this work, the

centralized write buffer and the STT-MRAM register ﬁle

are concurrently accessed to minimize read latency. This

approach decreases the energy efﬁciency of the GPU register

ﬁle due to the redundant memory accesses. We compared the

effect of the centralized write buffer on the energy consump-

tion and performance with Hi-End in Section V. In addition,

unlike Hi-End, the endurance problem of STT-MRAM is not

considered in the previous studies.

Liu et al. proposed a Multi-Level Cell (MLC) STT-MRAM

design, they achieve a larger storage density of register

ﬁle. Furthermore, they proposed an MLC-aware register ﬁle

remapping strategy and a warp rescheduling scheme to opti-

mize the different read and write performance between soft

and hard-bit operations in MLC STT-MRAM. This work fo-

cused on the optimization techniques for MLC STT-MRAM,

whereas our work uses hierarchical approaches and the wear-

leveling technique. Consequently, the MLC-aware technique

proposed herein is orthogonal to Hi-End; hence, it can be

combined.

GPU register ﬁle with compression: Previous works uti-

lized several compression techniques to minimize the static

or dynamic power consumption of the SRAM-based register

ﬁle in GPUs.

Lee et al. proposed warped-compression, an SRAM-based

GPU register ﬁle with BDI compression algorithm [19],

[29]. They restricted base and delta parameters and adopted

BDI compression directly to a warp register in an SRAM-

based register ﬁle GPU. After values are compressed, a

smaller number of register banks is used, and the unused

the power consumption of GPUs. In addition to adopting

the compression technique to an STT-MRAM-based GPU

speciﬁc to STT-MRAM. Unlike SRAM that merely beneﬁts

from power saving, the compression technique for STT-

MRAM additionally beneﬁts from endurance with the sup-

port of BWL in Hi-End. In addition, the effect of power

saving due to the compression technique of STT-MRAM

is more prominent than that of SRAM, since STT-MRAM

requires a higher dynamic power consumption.

Wong et al. proposed warp approximation, which stores

a representative value for a warp using an approximation

technique [30]. They demonstrated effects similar to those

of warped-compression and reduced the number of read or

write bits, thereby reducing the dynamic and static power

of the SRAM-based register ﬁle. Their approach can fur-

ther increase the compression ratio compared to warped-

compression, as values that differ slightly from the rep-

resentative value can be additionally removed. Since the

approximation-based compression technique is orthogonal to

Hi-End, it can be applied to our work.

VOLUME 4, 2016 11

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2020.3008719, IEEE Access

W. Jeon et al.: Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for Energy-Efﬁcient GPUs

VII. CONCLUSION

With the scaling down of CMOS and adaption of larger

of conventional SRAM-based register ﬁles becomes a crit-

ical problem. Recent studies proposed the usage of STT-

MRAM as a register ﬁle because of its low leakage current

and acceptable read performance. However, STT-MRAM-

based GPU register ﬁles must be carefully designed due to

write overheads and endurance problems. These challenges

have been addressed in various existing techniques. However,

write latency and endurance issues were not completely

addressed. Hence, addressing these issues, we propose Hi-

End register ﬁle architecture for GPUs. In Hi-End, we adopt

a register cache to exploit the locality in the register ﬁle and a

delay buffer to cover evicted data. Moreover, we implement a

compression unit to reduce high dynamic power and employ

BWL to address concentrated accesses of compressed data.

Our experimental results demonstrate that Hi-End reduces

energy consumption by 70.02% with only a 0.86% perfor-

mance degradation compared to an SRAM-based register ﬁle.

Moreover, Hi-End reduces 95.98% of the write access to the

reducing write accesses with register caches.

ACKNOWLEDGMENT

Won Jeon and Jun Hyun Park contributed equally to this

work.

REFERENCES

[1] NVIDIA, NVIDIA’s Next Generation CUDA Compute Architecture:

Fermi, 2009.

[2] ——, NVIDIA A100 Tensor Core GPU Architecture: UNPRECEDENTED

ACCELERATION AT EVERY SCALE, 2020.

[3] J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M.

Aamodt, and V. J. Reddi, “Gpuwattch: enabling energy optimizations in

gpgpus,” in ACM SIGARCH Computer Architecture News, vol. 41, no. 3.

ACM, 2013, pp. 487–498.

[4] H. Zhang, X. Chen, N. Xiao, and F. Liu, “Architecting energy-efﬁcient stt-

ram based register ﬁle on gpgpus via delta compression,” in Proceedings

of the 53rd Annual Design Automation Conference. ACM, 2016, p. 119.

[5] G. Li, X. Chen, G. Sun, H. Hoffmann, Y. Liu, Y. Wang, and H. Yang,

“A stt-ram-based low-power hybrid register ﬁle for gpgpus,” in 2015 52nd

ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 2015,

pp. 1–6.

[6] X. Liu, M. Mao, X. Bi, H. Li, and Y. Chen, “An efﬁcient stt-ram-based

Design Automation Conference. IEEE, 2015, pp. 490–495.

[7] S. Mittal, “A survey of techniques for architecting and managing gpu

vol. 28, no. 1, pp. 16–28, 2016.

[8] H. Zhang, X. Chen, N. Xiao, L. Wang, F. Liu, W. Chen, and Z. Chen,

“Shielding stt-ram based register ﬁles on gpus against read disturbance,”

ACM Journal on Emerging Technologies in Computing Systems (JETC),

vol. 13, no. 2, pp. 1–17, 2016.

[9] Z. Gong, K. Qiu, W. Chen, Y. Ni, Y. Xu, and J. Yang, “Redesigning

pipeline when architecting stt-ram as registers in rad-hard environment,”

Sustainable Computing: Informatics and Systems, vol. 22, pp. 206–218,

2019.

[10] Q. Deng, Y. Zhang, M. Zhang, and J. Yang, “Towards warp-scheduler

friendly stt-ram/sram hybrid gpgpu register ﬁle design,” in 2017

IEEE/ACM International Conference on Computer-Aided Design (IC-

CAD). IEEE, 2017, pp. 736–742.

[11] Y. Ni, Z. Gong, W. Chen, C. Yang, and K. Qiu, “State-transition-aware

spilling heuristic for mlc stt-ram-based registers,” VLSI Design, vol. 2017,

2017.

[12] X. Guo, E. Ipek, and T. Soyata, “Resistive computation: avoiding

the power wall with low-leakage, stt-mram based computing,” ACM

SIGARCH Computer Architecture News, vol. 38, no. 3, pp. 371–382, 2010.

[13] J. Zhang, M. Jung, and M. Kandemir, “Fuse: Fusing stt-mram into gpus to

alleviate off-chip memory access overheads,” in 2019 IEEE International

Symposium on High Performance Computer Architecture (HPCA). IEEE,

2019, pp. 426–439.

[14] NVIDIA, NVIDIA’s Next Generation CUDA Compute Architecture: Ke-

pler GK110, 2012.

[15] ——, NVIDIA GeForce GTX 980: Featuring Maxwell, The Most Advanced

GPU Ever Made, 2014.

[16] ——, NVIDIA Tesla P100: The Most Advanced Datacenter Accelerator

Ever Built, 2016.

[17] ——, NVIDIA Tesla V100 GPU Architecture: THE WORLD’S MOST

ADVANCED DATA CENTER GPU, 2017.

[18] ——, NVIDIA Turing GPU Architecture: Graphics Reinvented, 2018.

[19] S. Lee, K. Kim, G. Koo, H. Jeon, W. W. Ro, and M. Annavaram, “Warped-

compression: enabling power efﬁcient gpus through register compression,”

in ACM SIGARCH Computer Architecture News, vol. 43, no. 3. ACM,

2015, pp. 502–514.

[20] S. Mittal, J. S. Vetter, and D. Li, “A survey of architectural approaches

for managing embedded dram and non-volatile on-chip caches,” IEEE

Transactions on Parallel and Distributed Systems, vol. 26, no. 6, pp. 1524–

1537, 2014.

[21] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, “Nvsim: A circuit-level

performance, energy, and area model for emerging nonvolatile memory,”

IEEE Transactions on Computer-Aided Design of Integrated Circuits and

Systems, vol. 31, no. 7, pp. 994–1007, 2012.

[22] S. Yazdanshenas, M. R. Pirbasti, M. Fazeli, and A. Patooghy, “Coding last

level stt-ram cache for high endurance and low power,” IEEE computer

architecture letters, vol. 13, no. 2, pp. 73–76, 2013.

[23] A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt,

“Analyzing cuda workloads using a detailed gpu simulator,” in 2009

IEEE International Symposium on Performance Analysis of Systems and

Software. IEEE, 2009, pp. 163–174.

[24] G. Pekhimenko, V. Seshadri, O. Mutlu, P. B. Gibbons, M. A. Kozuch,

and T. C. Mowry, “Base-delta-immediate compression: practical data

compression for on-chip caches,” in Proceedings of the 21st international

conference on Parallel architectures and compilation techniques. ACM,

2012, pp. 377–388.

[25] J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari,

G. D. Liu, and W.-m. W. Hwu, “Parboil: A revised benchmark suite for

scientiﬁc and commercial throughput computing,” Center for Reliable and

High-Performance Computing, vol. 127, 2012.

[26] NVIDIA, NVIDIA CUDA SDK Code Sample 4.0, March 2016.

[27] S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron,

“A characterization of the rodinia benchmark suite with comparison to

contemporary cmp workloads,” in IEEE International Symposium on

Workload Characterization (IISWC’10). IEEE, 2010, pp. 1–11.

[28] L.-N. Pouchet, “Polybench: The polyhedral benchmark suite,” URL:

http://www. cs. ucla. edu/pouchet/software/polybench, 2012.

[29] S. Lee, K. Kim, G. Koo, H. Jeon, M. Annavaram, and W. W. Ro, “Improv-

ing energy efﬁciency of gpus through data compression and compressed

execution,” IEEE Transactions on Computers, vol. 66, no. 5, pp. 834–847,

2016.

[30] D. Wong, N. S. Kim, and M. Annavaram, “Approximating warps with

intra-warp operand value similarity,” in 2016 IEEE International Sympo-

sium on High Performance Computer Architecture (HPCA). IEEE, 2016,

pp. 176–187.

12 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/ACCESS.2020.3008719, IEEE Access

W. Jeon et al.: Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for Energy-Efﬁcient GPUs

WON JEON received the B.S. degree in electrical

and electronic engineering from Yonsei Univer-

sity, Seoul, Korea, in 2014. He is currently work-

ing toward the Ph.D. degree in the Embedded Sys-

tems and Computer Architecture Laboratory, the

School of Electrical and Electronic Engineering,

Yonsei University. His current research interests

are GPU memory systems, non-volatile memory,

processing-in-memory architecture designs, and

approximate computing for neural network appli-

cations. He is a student member of the IEEE.

JUN HYUN PARK received the B.S. degree in

electrical and electronic engineering from Chung-

Ang University, Seoul, Korea, in 2018 and the

M.S. degree in electrical and electronic engineer-

ing from Yonsei University, Seoul, Korea, in 2020.

He is currently working as an IP design engineer

with System LSI Division, Samsung Electronics.

His current research interests are interconnection

technology in SoC and GPU architecture.

YOONSOO KIM received the B.S. and M.S.

degrees in electrical and electronic engineering

from Yonsei University, Seoul, Korea, in 2016 and

2018. He is currently working as a system engineer

with NAND Solution Division, SK Hynix. His

current research interests are functionality of solid

state drive and GPU architecture.

GUNJAE KOO received the B.S. and M.S. de-

grees in electrical and computer engineering from

Seoul National University in 2001 and 2003 re-

spectively, and Ph.D. degree in electrical engineer-

ing from the University of Southern California in

2018. Currently, he is an assistant professor in the

department of computer science and engineering,

Korea University. His research interest is in the

general area of computer system architecture and

spans parallel processor architecture, storage and

memory systems, accelerators, and secure processor architecture. Prior to

joining Korea University, he was an assistant professor at Hongik Univer-

sity. His industry experience includes a senior research engineer with LG

Electronics and a research intern with Intel. He is a member of the IEEE and

the ACM.

WON WOO RO received the B.S. degree in elec-

trical engineering from Yonsei University, Seoul,

Korea, in 1996, and the M.S. and Ph.D. degrees

in electrical engineering from the University of

Southern California, in 1999 and 2004, respec-

tively. He worked as a research scientist with

the Electrical Engineering and Computer Science

Department, University of California, Irvine. He

currently works as a professor with the School

of Electrical and Electronic Engineering, Yonsei

University. Prior to joining Yonsei University, he worked as an assistant

professor with the Department of Electrical and Computer Engineering,

California State University, Northridge. His industry experience includes a

college internship with Apple Computer, Inc. and a contract software engi-

neer with ARM, Inc. His current research interests include high-performance

microprocessor design, GPU microarchitectures, neural network accelera-

tors, and memory hierarchy design. He is a senior member of the IEEE.

VOLUME 4, 2016 13

TEA-RC: Thread Context-Aware Register Cache for GPUs

Article

Full-text available

Jan 2022

Graphics processing units (GPUs) achieve high throughput by exploiting a high degree of thread-level parallelism (TLP). To support such high TLP, GPUs have a large-sized register file to store the context of all threads, consuming around 20% of total GPU energy. Several previous studies have attempted to minimize the energy consumption of the register file by implementing an emerging non-volatile memory (NVM), leveraging its higher density and lower leakage power over SRAMs. To amortize the cost of long access latency of NVM, prior work adopts a hierarchical register file consisting of an SRAM-based register cache and NVM-based registers where the register cache works as a write buffer. To get the register cache index, they use the partially selected bits of warp ID and register ID. This work observes that such an index calculation causes three types of contentions leading to the underutilization of the register cache: inter-warp , intra-warp , and false contentions. To minimize such contentions, this paper proposes a thread context-aware register cache (TEA-RC) in GPUs. In TEA-RC, the cache index is calculated considering the high correlation between the number of scheduled threads and the register usage of threads. The proposed design shows 28.5% higher performance and 9.1 percentage point lower energy consumption over the conventional register cache that concatenates three bits of warp ID and five bits of register ID to compute the cache index.

Advanced hybrid MRAM based novel GPU cache system for graphic processing with high efficiency

Article

Full-text available

Jan 2024

With the rapid development of portable computing devices and users’ demand for high-quality graphics rendering, embedded Graphics Processing Units (GPU) systems for graphics processing are increasingly turning into a key component of computer architecture to enhance computability. The cache system based on traditional static random access memory (SRAM) plays a crucial role in GPUs. But high leakage, low lifetime and poor integration problems deeply plague the science and engineering field. In the paper, a novel magnetic random access memory (MRAM) based cache architecture of GPU systems is proposed for highly efficient graphics processing and computing accelerating, with the merits of high speed, long endurance, strong interference resistance, and ultra-low power consumption. Spin transfer torque-MRAM and spin orbit torque-MRAM are utilized in off-chip and on-chip caches, respectively. A controller design scheme with prefetching modules and optimized cache coherency protocols are adopted. After testing and evaluating with multiple loads, neural network models and datasets, the simulation results show that the proposed system can achieve up to 28%, 56%, and 66.45% optimizations mostly in terms of speed, energy and leakage power, respectively.

Survey of Novel Architectures for Energy Efficient High-Performance Mobile Computing Platforms

Article

Full-text available

Aug 2023

There are many real-world applications that require high-performance mobile computing systems for onboard, real-time processing of gathered data due to latency, reliability, security, or other application constraints. Unfortunately, most existing high-performance mobile computing systems require a prohibitively high power consumption in the face of the limited power available from the batteries typically used in these applications. For high-performance mobile computing to be practical, alternative hardware designs are needed to increase the computing performance while minimizing the required power consumption. This article surveys the state-of-the-art in high-efficiency, high-performance onboard mobile computing, focusing on the latest developments. It was found that more research is needed to design high-performance mobile computing systems while minimizing the required power consumption to meet the needs of these applications.

RRCD: Redirección de Registros Basada en Compresión de Datos para Tolerar Fallos Permanentes en una GPU

Conference Paper

Full-text available

Sep 2021

La creciente demanda de paralelismo de las aplicaciones de propósito general en unidades de procesamiento gráfico (GPU) empuja hacia bancos de registros cada vez más grandes y con un mayor consumo energético en sucesivas generaciones. Reducir la tensión de alimentación más allá de su límite de seguridad es una forma eficaz de mejorar la eficiencia energética del banco de registros. Sin embargo, operar en estas tensiones tan bajas compromete la fiabilidad del circuito. Este trabajo tiene como objetivo tolerar fallos permanentes debidos a variaciones en el proceso de fabricación del banco de registros de una GPU operando por debajo del límite de seguridad. Para ello, este artículo propone una técnica microarquitectónica de redirección de registros, RRCD, que aprovecha la redundancia de los datos inherente en las aplicaciones para comprimir registros en tiempo de ejecución y sin la asistencia del compilador ni modificaciones en el repertorio de instrucciones. En lugar de deshabilitar toda una entrada de registro defectuosa, RRCD aprovecha las celdas fiables en una entrada defectuosa para redirigir y almacenar registros comprimidos. Los resultados experimentales muestran que, con más de un tercio de entradas de registro defectuosas, RRCD asegura la fiabilidad del banco de registros y reduce el consumo de energía en un 47% con respecto a un diseño convencional alimentado con una tensión nominal. El ahorro de energía es un 21% en comparación con un esquema de suavizado de ruido de tensión que opera en el límite seguro de tensión. Estos beneficios se obtienen con un impacto en el rendimiento y área del sistema menor que un 2 y 6%, respectivamente.

RRCD: Redirección de Registros Basada en Compresión de Datos para Tolerar Fallos Permanentes en una GPU

Preprint

Full-text available

May 2021

Monolithic 3D-based SRAM/MRAM Hybrid Memory for an Energy-efficient Unified L2 TLB-Cache Architecture

Article

Full-text available

Jan 2021

Young-Ho Gong

Monolithic 3D (M3D) integration has been emerged as a promising technology for fine-grained 3D stacking. As the M3D integration offers extremely small dimension of via in a nanometer-scale, it is beneficial for small microarchitectural blocks such as caches, register files, translation look-aside buffers (TLBs), etc. However, since the M3D integration requires low-temperature process for stacked layers, it causes lower performance for stacked transistors compared to the conventional 2D process. In contrast, non-volatile memory (NVM) such as magnetic RAM (MRAM) is originally fabricated at a low temperature, which enables the M3D integration without performance degradation. In this paper, we propose an energy-efficient unified L2 TLB-cache architecture exploiting M3D-based SRAM/MRAM hybrid memory. Since the M3D-based SRAM/MRAM hybrid memory consumes much smaller energy than the conventional 2D SRAM-only memory and 2D SRAM/MRAM hybrid memory, while providing comparable performance, our proposed architecture improves energy efficiency significantly. Especially, as our proposed architecture changes the memory partitioning of the unified L2 TLB-cache depending on the L2 cache miss rate, it maximizes the energy efficiency for parallel workloads suffering extremely high L2 cache miss rate. According to our analysis using PARSEC benchmark applications, our proposed architecture reduces the energy consumption of L2 TLB+L2 cache by up to 97.7% (53.6% on average), compared to the baseline with the 2D SRAM-only memory, with negligible impact on performance. Furthermore, our proposed technique reduces the memory access energy consumption by up to 32.8% (10.9% on average), by reducing memory accesses due to TLB misses.

DC-Patch: A Microarchitectural Fault Patching Technique for GPU Register Files

Article

Full-text available

Sep 2020

The ever-increasing parallelism demand of General-Purpose Graphics Processing Unit (GPGPU) applications pushes toward larger and more energy-hungry register files in successive GPU generations. Reducing the supply voltage beyond its safe limit is an effective way to improve the energy efficiency of register files. However, at these operating voltages, the reliability of the circuit is compromised. This work aims to tolerate permanent faults from process variations in large GPU register files operating below the safe supply voltage limit. To do so, this paper proposes a microarchitectural patching technique, DC-Patch, exploiting the inherent data redundancy of applications to compress registers at run-time with neither compiler assistance nor instruction set modifications. Instead of disabling an entire faulty register file entry, DC-Patch leverages the reliable cells within a faulty entry to store compressed register values. Experimental results show that, with more than a third of faulty register entries, DC-Patch ensures a reliable operation of the register file and reduces the energy consumption by 47% with respect to a conventional register file working at nominal supply voltage. The energy savings are 21% compared to a voltage noise smoothing scheme operating at the safe supply voltage limit. These benefits are obtained with less than 2 and 6% impact on the system performance and area, respectively.

Conflict-aware compiler for hierarchical register file on GPUs

Article

Feb 2024
J SYST ARCHITECT

Functionalization of carbon and graphene quantum dots

Chapter

Jan 2023

CASH-RF: A Compiler-Assisted Hierarchical Register File in GPUs

Article

Dec 2022

Spin-Transfer Torque Magnetic Random-Access Memory (STT-MRAM) is an emerging non-volatile memory technology that has been received significant attention due to its higher density and lower leakage current over SRAM. One compelling use case is to employ STT-MRAM as a GPU Register File (RF) to reduce its massive energy consumption. One critical challenge is that STT-MRAM has longer access latency and higher dynamic power consumption than SRAM, which motivates the hierarchical RF that places a small SRAM-based Register Cache (RC) between functional units and STT-MRAM RF. The RC acts as the write buffer, so all the writes on the RF are first performed on the RC. In the presence of a conflict miss, the RC writes back the corresponding cache line into the RF. In this work, we observe that a large amount of such write-back operations are unnecessary because they include register values that are never used again. Leveraging this observation, we propose a Compiler-ASsisted Hierarchical Register File in GPUs (CASH-RF) that optimizes STT-MRAM accesses by removing dead register values. In CASH-RF, unnecessary write-back operations are detected by the compiler via the last consumer analysis. At runtime, the corresponding RC lines are discarded after the last references without being updated to the RF. Compared to the baseline GPUs, CASH-RF removes 59.5% of write-back operations, which leads to 54.7% lower RF energy consumption with only 2.6% of performance degradation.

State-Transition-Aware Spilling Heuristic for MLC STT-RAM-Based Registers

Article

Full-text available

Nov 2017
VLSI Des

Multilevel Cell Spin-Transfer Torque Random Access Memory (MLC STT-RAM) is a promising nonvolatile memory technology to build registers for its natural immunity to electromagnetic radiation in rad-hard space environment. Unlike traditional SRAM-based registers, MLC STT-RAM exhibits unbalanced write state transitions due to the fact that the magnetization directions of hard and soft domains cannot be flipped independently. This feature leads to nonuniform costs of write states in terms of latency and energy. However, current SRAM-targeting register allocations do not have a clear understanding of the impact of the different write state-transition costs. As a result, those approaches heuristically select variables to be spilled without considering the spilling priority imposed by MLC STT-RAM. Aiming to address this limitation, this paper proposes a state-transition-aware spilling cost minimization (SSCM) policy, to save power when MLC STT-RAM is employed in register design. Specifically, the spilling cost model is first constructed according to the linear combination of different state-transition frequencies. Directed by the proposed cost model, the compiler picks up spilling candidates to achieve lower power and higher performance. Experimental results show that the proposed SSCM technique can save energy by 19.4% and improve the lifetime by 23.2% of MLC STT-RAM-based register design.

A Survey of Techniques for Architecting and Managing GPU Register File

Article

Full-text available

Jan 2016

Sparsh Mittal

To support their massively-multithreaded architecture, GPUs use very large register file (RF) which has a capacity higher than even L1 and L2 caches. In total contrast, traditional CPUs use tiny RF and much larger caches to optimize latency. Due to these differences, along with the crucial impact of RF in determining GPU performance, novel and intelligent techniques are required for managing GPU RF. In this paper, we survey the techniques for designing and managing GPU RF. We discuss techniques related to performance, energy and reliability aspects of RF. To emphasize the similarities and differences between the techniques, we classify them along several parameters. The aim of this paper is to synthesize the state-of-art developments in RF management and also stimulate further research in this area.

FUSE: Fusing STT-MRAM into GPUs to Alleviate Off-Chip Memory Access Overheads

Conference Paper

Feb 2019

Redesigning Pipeline When Architecting STT-RAM as Registers in Rad-Hard Environment

Article

Oct 2018

Electromagnetic radiation effects can cause several types of errors on traditional SRAM-based registers such as single event upset (SEU) and single event functional interrupt (SEFI). Especially in aerospace where radiation is quite intense, the stability and correctness of systems are greatly affected. By exploiting the beneficial features of high radiation resistance and non-volatility, spin-transfer torque RAM (STT-RAM), a kind of emerging nonvolatile memory (NVM), is promising to be used as registers to avoid errors caused by radiation. However, substituting SRAM with STT-RAM in registers will affect system performance because STT-RAM suffers from long write latency. The early write termination (EWT) method has been accepted as an effective technique to mitigate write problems by terminating redundant writes. Based on the above background, this paper proposes to build registers by STT-RAM for embedded systems in rad-hard environment. Targeting the microarchitecture level of pipeline, the impact of architecting STT-RAM-based registers is discussed considering data hazard due to data dependencies. Furthermore, integrated with the EWT technique, Read Merging and Write Skipping methods are proposed to eliminate redundant normal reads and writes. As a result of carrying out these actions, the energy and performance can be improved greatly. The results report 70% (and 77%) and 39% (and 47%) improvements on performance (and energy) by the proposed Read Merging and Write Skipping compared to the cases where STT-RAM is naively used as registers and intelligently used by integrating EWT, respectively.

Towards warp-scheduler friendly STT-RAM/SRAM hybrid GPGPU register file design

Conference Paper

Nov 2017

Modern Graphics Processing Units (GPUs) widely adopt large SRAM based register file (RF) to enable fast context-switch. A large SRAM RF may consume 20% to 40% GPUpower, which has come one of the major design challenges for GPUs. Recent studies mitigate the issue through hybrid RF designs that architect a large STT-RAM (Spin Transfer torque magnetic memory) RF and a small SRAM buffer. However, the long STT-RAM writes latency throttles the data exchange between-RAM and SRAM, which deprecates warp scheduler with frequent context switches, e.g., round robin scheduler. In this paper, we propose HC-RF, a warp-scheduler friendly hybrid RF design using novel SRAM/STT-RAM hybrid cell (HC)structure. HC-RF exploits cell level integration to improve the effective bandwidth between STT-RAM and SRAM. By enabling silent data transfer from SRAM to STT-RAM without blocking banks, HC-RF supports concurrent context-switching and decouples its dependency on warp scheduler. Our experimental results show that, on average, HC-RF achieves 50% performance improvement and 44% energy consumption reduction over the coarse-grained hybrid design when adopting LRR(Loose RoundRobin) warp scheduler.

Improving Energy Efficiency of GPUs through Data Compression and Compressed Execution

Article

Jan 2016

GPU design trends show that the register file size will continue to increase to enable even more thread level parallelism. As a result register file consumes a large fraction of the total GPU chip power. This paper explores register file data compression for GPUs to improve power efficiency. Compression reduces the width of the register file read and write operations, which in turn reduces dynamic power. This work is motivated by the observation that the register values of threads within the same warp are similar, namely the arithmetic differences between two successive thread registers is small. Compression exploits the value similarity by removing data redundancy of register values. Without decompressing operand values some instructions can be processed inside register file, which enables to further save energy by minimizing data movement and processing in power hungry main execution unit. Evaluation results show that the proposed techniques save 25 percent of the total register file energy consumption and 21 percent of the total execution unit energy consumption with negligible performance impact.

Shielding STT-RAM based register files on GPUs against read disturbance

Article

Nov 2016

To address the high energy consumption issue of SRAM on GPUs, emerging Spin-Transfer Torque (STT-RAM) memory technology has been intensively studied to build GPU register files for better energy-efficiency, thanks to its benefits of low leakage power, high density, and good scalability. However, STT-RAM suffers from the read disturbance issue, which stems from the fact that the voltage difference between read current and write current becomes smaller as technology scales. The read disturbance leads to high error rates for read operations, which cannot be effectively protected by the SEC-DED ECC on large-capacity register files of GPUs. Prior schemes (e.g., read-restore) to mitigate the read disturbance usually incur either non-trivial performance loss or excessive energy overhead, thus not applicable for the GPU register file design that aims to achieve both high performance and energy-efficiency. To combat the read disturbance, we propose a novel software-hardware co-designed solution (i.e., Red-Shield), which consists of three optimizations to overcome the limitations of the existing solutions. First, we identify dead reads at compiling stage and augment instructions to avoid unnecessary restores. Second, we employ a small read buffer to accommodate register reads with high-access locality to further reduce restores. Third, we propose an adaptive restore mechanism to selectively pick the suitable restore scheme, according to the busy status of corresponding register banks. Experimental results show that our proposed design can effectively mitigate the performance loss and energy overhead caused by restore operations while still maintaining the reliability of reads.

Architecting energy-efficient STT-RAM based register file on GPGPUs via delta compression

Conference Paper

Jun 2016

To facilitate efficient context switches, GPUs usually employ a large-capacity register file to accommodate a massive amount of context information. However, the large register file introduces high power consumption, flowing to high leakage power SRAM cells. Emerging non-volatile STT-RAM memory has recently been studied as a potential replacement to alleviate the leakage challenge when constructing register files on GPUs. Unfortunately, due to the long write latency and high energy consumption associated with write operations in STT-RAM, simply replacing SRAM with STTRAM for register files would incur non-trivial performance overhead and only bring marginal energy benefits. In this paper, we propose to optimize STT-RAM based GPU register files for better energy-efficiency and performance via two techniques. First, we employ a light-weight compression framework with awareness of register value similarity. It is coupled with a group-based write driver control to mitigate the high energy overhead caused by STT-RAM writes. Second, to address the long write latency overhead of STT-RAM, we propose a centralized SRAM-based write buffer design to efficiently absorb STT-RAM writes with better buffer utilization, rather than the conventional design with distributed per-bank based write buffers. The experimental results show that our STT-RAM based register file design consumes only 37.4% energy over the SRAM baseline, while incurring only negligible performance degradation.

Approximating warps with intra-warp operand value similarity

Conference Paper

Mar 2016

Warped-compression

Article

Jun 2015
Comput Architect News

This paper presents Warped-Compression, a warp-level register compression scheme for reducing GPU power consumption. This work is motivated by the observation that the register values of threads within the same warp are similar, namely the arithmetic differences between two successive thread registers is small. Removing data redundancy of register values through register compression reduces the effective register width, thereby enabling power reduction opportunities. GPU register files are huge as they are necessary to keep concurrent execution contexts and to enable fast context switching. As a result register file consumes a large fraction of the total GPU chip power. GPU design trends show that the register file size will continue to increase to enable even more thread level parallelism. To reduce register file data redundancy warped-compression uses low-cost and implementation-efficient base-delta-immediate (BDI) compression scheme, that takes advantage of banked register file organization used in GPUs. Since threads within a warp write values with strong similarity, BDI can quickly compress and decompress by selecting either a single register, or one of the register banks, as the primary base and then computing delta values of all the other registers, or banks. Warped-compression can be used to reduce both dynamic and leakage power. By compressing register values, each warp-level register access activates fewer register banks, which leads to reduction in dynamic power. When fewer banks are used to store the register content, leakage power can be reduced by power gating the unused banks. Evaluation results show that register compression saves 25% of the total register file power consumption.

Hi-End: Hierarchical, Endurance-Aware STT-MRAM-Based Register File for Energy-Efficient GPUs

Abstract and Figures

Recommended publications

CASH-RF: A Compiler-Assisted Hierarchical Register File in GPUs

Architecting energy-efficient STT-RAM based register file on GPGPUs via delta compression

TEA-RC: Thread Context-Aware Register Cache for GPUs

Shielding STT-RAM based register files on GPUs against read disturbance