FIGURE 6. Hi-End architecture.

Advanced hybrid MRAM based novel GPU cache system for graphic processing with high efficiency

Article

Full-text available

Jan 2024

With the rapid development of portable computing devices and users’ demand for high-quality graphics rendering, embedded Graphics Processing Units (GPU) systems for graphics processing are increasingly turning into a key component of computer architecture to enhance computability. The cache system based on traditional static random access memory (SRAM) plays a crucial role in GPUs. But high leakage, low lifetime and poor integration problems deeply plague the science and engineering field. In the paper, a novel magnetic random access memory (MRAM) based cache architecture of GPU systems is proposed for highly efficient graphics processing and computing accelerating, with the merits of high speed, long endurance, strong interference resistance, and ultra-low power consumption. Spin transfer torque-MRAM and spin orbit torque-MRAM are utilized in off-chip and on-chip caches, respectively. A controller design scheme with prefetching modules and optimized cache coherency protocols are adopted. After testing and evaluating with multiple loads, neural network models and datasets, the simulation results show that the proposed system can achieve up to 28%, 56%, and 66.45% optimizations mostly in terms of speed, energy and leakage power, respectively.

Survey of Novel Architectures for Energy Efficient High-Performance Mobile Computing Platforms

Article

Full-text available

Aug 2023

There are many real-world applications that require high-performance mobile computing systems for onboard, real-time processing of gathered data due to latency, reliability, security, or other application constraints. Unfortunately, most existing high-performance mobile computing systems require a prohibitively high power consumption in the face of the limited power available from the batteries typically used in these applications. For high-performance mobile computing to be practical, alternative hardware designs are needed to increase the computing performance while minimizing the required power consumption. This article surveys the state-of-the-art in high-efficiency, high-performance onboard mobile computing, focusing on the latest developments. It was found that more research is needed to design high-performance mobile computing systems while minimizing the required power consumption to meet the needs of these applications.

TEA-RC: Thread Context-Aware Register Cache for GPUs

Article

Full-text available

Jan 2022

Graphics processing units (GPUs) achieve high throughput by exploiting a high degree of thread-level parallelism (TLP). To support such high TLP, GPUs have a large-sized register file to store the context of all threads, consuming around 20% of total GPU energy. Several previous studies have attempted to minimize the energy consumption of the register file by implementing an emerging non-volatile memory (NVM), leveraging its higher density and lower leakage power over SRAMs. To amortize the cost of long access latency of NVM, prior work adopts a hierarchical register file consisting of an SRAM-based register cache and NVM-based registers where the register cache works as a write buffer. To get the register cache index, they use the partially selected bits of warp ID and register ID. This work observes that such an index calculation causes three types of contentions leading to the underutilization of the register cache: inter-warp , intra-warp , and false contentions. To minimize such contentions, this paper proposes a thread context-aware register cache (TEA-RC) in GPUs. In TEA-RC, the cache index is calculated considering the high correlation between the number of scheduled threads and the register usage of threads. The proposed design shows 28.5% higher performance and 9.1 percentage point lower energy consumption over the conventional register cache that concatenates three bits of warp ID and five bits of register ID to compute the cache index.

RRCD: Redirección de Registros Basada en Compresión de Datos para Tolerar Fallos Permanentes en una GPU

Conference Paper

Full-text available

Sep 2021

La creciente demanda de paralelismo de las aplicaciones de propósito general en unidades de procesamiento gráfico (GPU) empuja hacia bancos de registros cada vez más grandes y con un mayor consumo energético en sucesivas generaciones. Reducir la tensión de alimentación más allá de su límite de seguridad es una forma eficaz de mejorar la eficiencia energética del banco de registros. Sin embargo, operar en estas tensiones tan bajas compromete la fiabilidad del circuito. Este trabajo tiene como objetivo tolerar fallos permanentes debidos a variaciones en el proceso de fabricación del banco de registros de una GPU operando por debajo del límite de seguridad. Para ello, este artículo propone una técnica microarquitectónica de redirección de registros, RRCD, que aprovecha la redundancia de los datos inherente en las aplicaciones para comprimir registros en tiempo de ejecución y sin la asistencia del compilador ni modificaciones en el repertorio de instrucciones. En lugar de deshabilitar toda una entrada de registro defectuosa, RRCD aprovecha las celdas fiables en una entrada defectuosa para redirigir y almacenar registros comprimidos. Los resultados experimentales muestran que, con más de un tercio de entradas de registro defectuosas, RRCD asegura la fiabilidad del banco de registros y reduce el consumo de energía en un 47% con respecto a un diseño convencional alimentado con una tensión nominal. El ahorro de energía es un 21% en comparación con un esquema de suavizado de ruido de tensión que opera en el límite seguro de tensión. Estos beneficios se obtienen con un impacto en el rendimiento y área del sistema menor que un 2 y 6%, respectivamente.

RRCD: Redirección de Registros Basada en Compresión de Datos para Tolerar Fallos Permanentes en una GPU

Preprint

Full-text available

May 2021

La creciente demanda de paralelismo de las aplicaciones de propósito general en unidades de procesamiento gráfico (GPU) empuja hacia bancos de registros cada vez más grandes y con un mayor consumo energético en sucesivas generaciones. Reducir la tensión de alimentación más allá de su límite de seguridad es una forma eficaz de mejorar la eficiencia energética del banco de registros. Sin embargo, operar en estas tensiones tan bajas compromete la fiabilidad del circuito. Este trabajo tiene como objetivo tolerar fallos permanentes debidos a variaciones en el proceso de fabricación del banco de registros de una GPU operando por debajo del límite de seguridad. Para ello, este artículo propone una técnica microarquitectónica de redirección de registros, RRCD, que aprovecha la redundancia de los datos inherente en las aplicaciones para comprimir registros en tiempo de ejecución y sin la asistencia del compilador ni modificaciones en el repertorio de instrucciones. En lugar de deshabilitar toda una entrada de registro defectuosa, RRCD aprovecha las celdas fiables en una entrada defectuosa para redirigir y almacenar registros comprimidos. Los resultados experimentales muestran que, con más de un tercio de entradas de registro defectuosas, RRCD asegura la fiabilidad del banco de registros y reduce el consumo de energía en un 47% con respecto a un diseño convencional alimentado con una tensión nominal. El ahorro de energía es un 21% en comparación con un esquema de suavizado de ruido de tensión que opera en el límite seguro de tensión. Estos beneficios se obtienen con un impacto en el rendimiento y área del sistema menor que un 2 y 6%, respectivamente.

Monolithic 3D-based SRAM/MRAM Hybrid Memory for an Energy-efficient Unified L2 TLB-Cache Architecture

Article

Full-text available

Jan 2021

Young-Ho Gong

Monolithic 3D (M3D) integration has been emerged as a promising technology for fine-grained 3D stacking. As the M3D integration offers extremely small dimension of via in a nanometer-scale, it is beneficial for small microarchitectural blocks such as caches, register files, translation look-aside buffers (TLBs), etc. However, since the M3D integration requires low-temperature process for stacked layers, it causes lower performance for stacked transistors compared to the conventional 2D process. In contrast, non-volatile memory (NVM) such as magnetic RAM (MRAM) is originally fabricated at a low temperature, which enables the M3D integration without performance degradation. In this paper, we propose an energy-efficient unified L2 TLB-cache architecture exploiting M3D-based SRAM/MRAM hybrid memory. Since the M3D-based SRAM/MRAM hybrid memory consumes much smaller energy than the conventional 2D SRAM-only memory and 2D SRAM/MRAM hybrid memory, while providing comparable performance, our proposed architecture improves energy efficiency significantly. Especially, as our proposed architecture changes the memory partitioning of the unified L2 TLB-cache depending on the L2 cache miss rate, it maximizes the energy efficiency for parallel workloads suffering extremely high L2 cache miss rate. According to our analysis using PARSEC benchmark applications, our proposed architecture reduces the energy consumption of L2 TLB+L2 cache by up to 97.7% (53.6% on average), compared to the baseline with the 2D SRAM-only memory, with negligible impact on performance. Furthermore, our proposed technique reduces the memory access energy consumption by up to 32.8% (10.9% on average), by reducing memory accesses due to TLB misses.

DC-Patch: A Microarchitectural Fault Patching Technique for GPU Register Files

Article

Full-text available

Sep 2020

The ever-increasing parallelism demand of General-Purpose Graphics Processing Unit (GPGPU) applications pushes toward larger and more energy-hungry register files in successive GPU generations. Reducing the supply voltage beyond its safe limit is an effective way to improve the energy efficiency of register files. However, at these operating voltages, the reliability of the circuit is compromised. This work aims to tolerate permanent faults from process variations in large GPU register files operating below the safe supply voltage limit. To do so, this paper proposes a microarchitectural patching technique, DC-Patch, exploiting the inherent data redundancy of applications to compress registers at run-time with neither compiler assistance nor instruction set modifications. Instead of disabling an entire faulty register file entry, DC-Patch leverages the reliable cells within a faulty entry to store compressed register values. Experimental results show that, with more than a third of faulty register entries, DC-Patch ensures a reliable operation of the register file and reduces the energy consumption by 47% with respect to a conventional register file working at nominal supply voltage. The energy savings are 21% compared to a voltage noise smoothing scheme operating at the safe supply voltage limit. These benefits are obtained with less than 2 and 6% impact on the system performance and area, respectively.

Conflict-aware compiler for hierarchical register file on GPUs

Article

Feb 2024
J SYST ARCHITECT

Functionalization of carbon and graphene quantum dots

Chapter

Jan 2023

CASH-RF: A Compiler-Assisted Hierarchical Register File in GPUs

Article

Dec 2022
IEEE Embed Syst Lett

Spin-Transfer Torque Magnetic Random-Access Memory (STT-MRAM) is an emerging non-volatile memory technology that has been received significant attention due to its higher density and lower leakage current over SRAM. One compelling use case is to employ STT-MRAM as a GPU Register File (RF) to reduce its massive energy consumption. One critical challenge is that STT-MRAM has longer access latency and higher dynamic power consumption than SRAM, which motivates the hierarchical RF that places a small SRAM-based Register Cache (RC) between functional units and STT-MRAM RF. The RC acts as the write buffer, so all the writes on the RF are first performed on the RC. In the presence of a conflict miss, the RC writes back the corresponding cache line into the RF. In this work, we observe that a large amount of such write-back operations are unnecessary because they include register values that are never used again. Leveraging this observation, we propose a Compiler-ASsisted Hierarchical Register File in GPUs (CASH-RF) that optimizes STT-MRAM accesses by removing dead register values. In CASH-RF, unnecessary write-back operations are detected by the compiler via the last consumer analysis. At runtime, the corresponding RC lines are discarded after the last references without being updated to the RF. Compared to the baseline GPUs, CASH-RF removes 59.5% of write-back operations, which leads to 54.7% lower RF energy consumption with only 2.6% of performance degradation.

Hi-End architecture.

Contexts in source publication

Similar publications

Citations