Hi-End architecture.

Hi-End architecture.

Source publication
Article
Full-text available
Modern Graphics Processing Units (GPUs) require large hardware resources for massive parallel thread executions. In particular, modern GPUs have a large register file composed of Static Random Access Memory (SRAM). Due to the high leakage current of SRAM, the register file consumes approximately 20% of the total GPU energy. The energy efficiency of...

Contexts in source publication

Context 1
... this section, we describe the detailed architecture of the proposed STT-MRAM-based hierarchical, endurance-aware register file, called Hi-End. The overall architecture of HiEnd is depicted in Fig. 6. Hi-End includes additional components (register cache, delay buffer, compression unit, and bank-level wear-leveling) on the baseline GPU register file. The newly added blocks are shown in gray in the figure. HiEnd employs a hierarchical structure that exploits an SRAMbased write cache to hide the long write latency of the STT-MRAM ...
Context 2
... GPU utilizes the operand collector to read register data from the unified register file. When a warp issues an instruction that accesses register data, the warp scheduler passes operand information required for register file accesses (see Fig. 6). Each operand collector contains the operand information, such as warp IDs, valid flags, register IDs, and ready flags. Such operand information is sent to Hi-End, to access register data from the hierarchical register ...

Similar publications

Article
Full-text available
In Image processing applications the utilization of on-chip Static Random Access memory (SRAM) in Field Programmable Gate Array (FPGA) is extremely important. True dual port (TDP) SRAM and Single port SRAM are typically available SRAM's for image processing applications, but in case of data access policy the memory need to redesign. Hence On-Chip m...

Citations

... In terms of the recent GPU applications of nonvolatile memory (NVM) technologies, a novel register file architecture and its management system for GPUs is proposed by Jeon et al. 13 Inci et al. put forward a framework to characterize, model, and analyze NVM-based caches in GPU architectures for deep learning (DL). 14 In addition, the recently released GPU architecture of NVIDIA reinforces the importance of on-chip storage in parallel computing architectures. ...
Article
Full-text available
With the rapid development of portable computing devices and users’ demand for high-quality graphics rendering, embedded Graphics Processing Units (GPU) systems for graphics processing are increasingly turning into a key component of computer architecture to enhance computability. The cache system based on traditional static random access memory (SRAM) plays a crucial role in GPUs. But high leakage, low lifetime and poor integration problems deeply plague the science and engineering field. In the paper, a novel magnetic random access memory (MRAM) based cache architecture of GPU systems is proposed for highly efficient graphics processing and computing accelerating, with the merits of high speed, long endurance, strong interference resistance, and ultra-low power consumption. Spin transfer torque-MRAM and spin orbit torque-MRAM are utilized in off-chip and on-chip caches, respectively. A controller design scheme with prefetching modules and optimized cache coherency protocols are adopted. After testing and evaluating with multiple loads, neural network models and datasets, the simulation results show that the proposed system can achieve up to 28%, 56%, and 66.45% optimizations mostly in terms of speed, energy and leakage power, respectively.
... While it is unlikely that a computing system will reach this perfect efficiency anytime in the near future, the conventional approaches in use are still multiple orders of magnitude away from the limit. According to Landauer's work, the theoretical minimum energy that can be consumed when a bit is flipped is 2.6 × 10 −21 J [20], but even current state-of-the art computing systems require something on the order of 1.0 × 10 −13 J [21]. ...
Article
Full-text available
There are many real-world applications that require high-performance mobile computing systems for onboard, real-time processing of gathered data due to latency, reliability, security, or other application constraints. Unfortunately, most existing high-performance mobile computing systems require a prohibitively high power consumption in the face of the limited power available from the batteries typically used in these applications. For high-performance mobile computing to be practical, alternative hardware designs are needed to increase the computing performance while minimizing the required power consumption. This article surveys the state-of-the-art in high-efficiency, high-performance onboard mobile computing, focusing on the latest developments. It was found that more research is needed to design high-performance mobile computing systems while minimizing the required power consumption to meet the needs of these applications.
... components, consuming around 15-20% of total GPU energy [9]- [13]. Over the past decade, researchers have explored several architectural techniques to minimize the energy consumption of the GPU register file [9]- [11], [14]- [18]. One attractive solution to minimize the energy consumption of the register file is to adopt emerging non-volatile memories (NVMs) -such as spin-transfer torque magnetoresistive random access memory (STT-MRAM) and spin-orbit torque MRAM (SOT-MRAM) [11], [14], [16] -as a substitute for static random-access memory (SRAM) of an existing register file. ...
... Over the past decade, researchers have explored several architectural techniques to minimize the energy consumption of the GPU register file [9]- [11], [14]- [18]. One attractive solution to minimize the energy consumption of the register file is to adopt emerging non-volatile memories (NVMs) -such as spin-transfer torque magnetoresistive random access memory (STT-MRAM) and spin-orbit torque MRAM (SOT-MRAM) [11], [14], [16] -as a substitute for static random-access memory (SRAM) of an existing register file. Leveraging their low leakage power, implementing the register file using NVMs significantly reduces the leakage energy consumption. ...
... Even with its lower leakage power, NVM cannot be solely used as the register file due to the longer access latency than SRAM, as shown in TABLE 1. To overcome the latency problem while taking advantage of low leakage power, researchers have proposed the hybrid or hierarchical register file composed of SRAM and NVM [11], [14], [19]. In the hybrid register file [14], the SRAM write buffers are inserted ahead of each STT-MRAM-based register bank. ...
Article
Full-text available
Graphics processing units (GPUs) achieve high throughput by exploiting a high degree of thread-level parallelism (TLP). To support such high TLP, GPUs have a large-sized register file to store the context of all threads, consuming around 20% of total GPU energy. Several previous studies have attempted to minimize the energy consumption of the register file by implementing an emerging non-volatile memory (NVM), leveraging its higher density and lower leakage power over SRAMs. To amortize the cost of long access latency of NVM, prior work adopts a hierarchical register file consisting of an SRAM-based register cache and NVM-based registers where the register cache works as a write buffer. To get the register cache index, they use the partially selected bits of warp ID and register ID. This work observes that such an index calculation causes three types of contentions leading to the underutilization of the register cache: inter-warp , intra-warp , and false contentions. To minimize such contentions, this paper proposes a thread context-aware register cache (TEA-RC) in GPUs. In TEA-RC, the cache index is calculated considering the high correlation between the number of scheduled threads and the register usage of threads. The proposed design shows 28.5% higher performance and 9.1 percentage point lower energy consumption over the conventional register cache that concatenates three bits of warp ID and five bits of register ID to compute the cache index.
... El banco de registros es una de las estructuras de memoria que más energía consume en una GPU, siendo responsable de aproximadamente un 20 % del consumo total de energía del dispositivo [9] y su consumo aumenta generación tras generación. Por ejemplo, el banco de registros de NVIDIA Tesla V100, con 20 MB, es 5 veces más grande que su homólogo en Tesla K40 [19]. ...
Conference Paper
Full-text available
La creciente demanda de paralelismo de las aplicaciones de propósito general en unidades de procesamiento gráfico (GPU) empuja hacia bancos de registros cada vez más grandes y con un mayor consumo energético en sucesivas generaciones. Reducir la tensión de alimentación más allá de su límite de seguridad es una forma eficaz de mejorar la eficiencia energética del banco de registros. Sin embargo, operar en estas tensiones tan bajas compromete la fiabilidad del circuito. Este trabajo tiene como objetivo tolerar fallos permanentes debidos a variaciones en el proceso de fabricación del banco de registros de una GPU operando por debajo del límite de seguridad. Para ello, este artículo propone una técnica microarquitectónica de redirección de registros, RRCD, que aprovecha la redundancia de los datos inherente en las aplicaciones para comprimir registros en tiempo de ejecución y sin la asistencia del compilador ni modificaciones en el repertorio de instrucciones. En lugar de deshabilitar toda una entrada de registro defectuosa, RRCD aprovecha las celdas fiables en una entrada defectuosa para redirigir y almacenar registros comprimidos. Los resultados experimentales muestran que, con más de un tercio de entradas de registro defectuosas, RRCD asegura la fiabilidad del banco de registros y reduce el consumo de energía en un 47% con respecto a un diseño convencional alimentado con una tensión nominal. El ahorro de energía es un 21% en comparación con un esquema de suavizado de ruido de tensión que opera en el límite seguro de tensión. Estos beneficios se obtienen con un impacto en el rendimiento y área del sistema menor que un 2 y 6%, respectivamente.
... El banco de registros es una de las estructuras de memoria que más energía consume en una GPU, siendo responsable de aproximadamente un 20 % del consumo total de energía del dispositivo [9] y su consumo aumenta generación tras generación. Por ejemplo, el banco de registros de NVIDIA Tesla V100, con 20 MB, es 5 veces más grande que su homólogo en Tesla K40 [19]. ...
Preprint
Full-text available
La creciente demanda de paralelismo de las aplicaciones de propósito general en unidades de procesamiento gráfico (GPU) empuja hacia bancos de registros cada vez más grandes y con un mayor consumo energético en sucesivas generaciones. Reducir la tensión de alimentación más allá de su límite de seguridad es una forma eficaz de mejorar la eficiencia energética del banco de registros. Sin embargo, operar en estas tensiones tan bajas compromete la fiabilidad del circuito. Este trabajo tiene como objetivo tolerar fallos permanentes debidos a variaciones en el proceso de fabricación del banco de registros de una GPU operando por debajo del límite de seguridad. Para ello, este artículo propone una técnica microarquitectónica de redirección de registros, RRCD, que aprovecha la redundancia de los datos inherente en las aplicaciones para comprimir registros en tiempo de ejecución y sin la asistencia del compilador ni modificaciones en el repertorio de instrucciones. En lugar de deshabilitar toda una entrada de registro defectuosa, RRCD aprovecha las celdas fiables en una entrada defectuosa para redirigir y almacenar registros comprimidos. Los resultados experimentales muestran que, con más de un tercio de entradas de registro defectuosas, RRCD asegura la fiabilidad del banco de registros y reduce el consumo de energía en un 47% con respecto a un diseño convencional alimentado con una tensión nominal. El ahorro de energía es un 21% en comparación con un esquema de suavizado de ruido de tensión que opera en el límite seguro de tensión. Estos beneficios se obtienen con un impacto en el rendimiento y área del sistema menor que un 2 y 6%, respectivamente.
... Also, they adopted a write buffer scheme to hide long write latency for MRAM, which is widely used in many other previous work on MRAM-based on-chip caches [ [41]. In addition to MRAM-based large caches, several researchers proposed MRAM-based register files [17] and L1 caches [20] [36], which also avoid long MRAM write latency based on small write buffers. ...
Article
Full-text available
Monolithic 3D (M3D) integration has been emerged as a promising technology for fine-grained 3D stacking. As the M3D integration offers extremely small dimension of via in a nanometer-scale, it is beneficial for small microarchitectural blocks such as caches, register files, translation look-aside buffers (TLBs), etc. However, since the M3D integration requires low-temperature process for stacked layers, it causes lower performance for stacked transistors compared to the conventional 2D process. In contrast, non-volatile memory (NVM) such as magnetic RAM (MRAM) is originally fabricated at a low temperature, which enables the M3D integration without performance degradation. In this paper, we propose an energy-efficient unified L2 TLB-cache architecture exploiting M3D-based SRAM/MRAM hybrid memory. Since the M3D-based SRAM/MRAM hybrid memory consumes much smaller energy than the conventional 2D SRAM-only memory and 2D SRAM/MRAM hybrid memory, while providing comparable performance, our proposed architecture improves energy efficiency significantly. Especially, as our proposed architecture changes the memory partitioning of the unified L2 TLB-cache depending on the L2 cache miss rate, it maximizes the energy efficiency for parallel workloads suffering extremely high L2 cache miss rate. According to our analysis using PARSEC benchmark applications, our proposed architecture reduces the energy consumption of L2 TLB+L2 cache by up to 97.7% (53.6% on average), compared to the baseline with the 2D SRAM-only memory, with negligible impact on performance. Furthermore, our proposed technique reduces the memory access energy consumption by up to 32.8% (10.9% on average), by reducing memory accesses due to TLB misses.
... 20% of the total GPU energy [22], and grows generation after generation. For instance, the register file of the NVIDIA Tesla V100, with 20 MB, is more than 5× larger than its counterpart in the Tesla K40 [35]. ...
Article
Full-text available
The ever-increasing parallelism demand of General-Purpose Graphics Processing Unit (GPGPU) applications pushes toward larger and more energy-hungry register files in successive GPU generations. Reducing the supply voltage beyond its safe limit is an effective way to improve the energy efficiency of register files. However, at these operating voltages, the reliability of the circuit is compromised. This work aims to tolerate permanent faults from process variations in large GPU register files operating below the safe supply voltage limit. To do so, this paper proposes a microarchitectural patching technique, DC-Patch, exploiting the inherent data redundancy of applications to compress registers at run-time with neither compiler assistance nor instruction set modifications. Instead of disabling an entire faulty register file entry, DC-Patch leverages the reliable cells within a faulty entry to store compressed register values. Experimental results show that, with more than a third of faulty register entries, DC-Patch ensures a reliable operation of the register file and reduces the energy consumption by 47% with respect to a conventional register file working at nominal supply voltage. The energy savings are 21% compared to a voltage noise smoothing scheme operating at the safe supply voltage limit. These benefits are obtained with less than 2 and 6% impact on the system performance and area, respectively.
Article
Spin-Transfer Torque Magnetic Random-Access Memory (STT-MRAM) is an emerging non-volatile memory technology that has been received significant attention due to its higher density and lower leakage current over SRAM. One compelling use case is to employ STT-MRAM as a GPU Register File (RF) to reduce its massive energy consumption. One critical challenge is that STT-MRAM has longer access latency and higher dynamic power consumption than SRAM, which motivates the hierarchical RF that places a small SRAM-based Register Cache (RC) between functional units and STT-MRAM RF. The RC acts as the write buffer, so all the writes on the RF are first performed on the RC. In the presence of a conflict miss, the RC writes back the corresponding cache line into the RF. In this work, we observe that a large amount of such write-back operations are unnecessary because they include register values that are never used again. Leveraging this observation, we propose a Compiler-ASsisted Hierarchical Register File in GPUs (CASH-RF) that optimizes STT-MRAM accesses by removing dead register values. In CASH-RF, unnecessary write-back operations are detected by the compiler via the last consumer analysis. At runtime, the corresponding RC lines are discarded after the last references without being updated to the RF. Compared to the baseline GPUs, CASH-RF removes 59.5% of write-back operations, which leads to 54.7% lower RF energy consumption with only 2.6% of performance degradation.