Yunho Oh
Yonsei University · School of Electrical and Electronic Engineering

About

Publications

1,046

Reads

209

Citations

Yunho Oh currently works at the School of Electrical and Electronic Engineering, Yonsei University. Yunho does research in Electrical Engineering, Computer Engineering and Electronic Engineering. Their most recent publication is 'Adaptive Cooperation of Prefetching and Warp Scheduling on GPUs'.

Skills and Expertise

Hardware Design

Embedded Systems

Reconfigurable Computing

Embedded Computing

Publications

APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs

Article

Jun 2016

Long memory latency and limited throughput become performance bottlenecks of GPGPU applications. The latency takes hundreds of cycles which is difficult to be hidden by simply interleaving tens of warp execution. While cache hierarchy helps to reduce memory system pressure, massive Thread-Level Parallelism (TLP) often causes excessive cache content...

GPU-Friendly Parallel Genome Matching with Tiled Access and Reduced State Transition Table

Article

Aug 2012

In this paper, we propose a new parallel genome matching algorithm using graphics processing units (GPUs). Our proposed approach is based on the Aho–Corasick algorithm and it was developed based on a consideration of the architectural features of existing GPUs with a hundred or more cores. Thus, we provide an appropriate task partitioning method th...

DRAW: Investigating benefits of adaptive fetch group size on GPU

Article

Apr 2015

Previously, hiding operation stalls is one of the important issues to suppress performance degradation of Graphics Processing Units (GPUs). In this paper, we first conduct a detailed study of factors affecting the operation stalls in terms of the fetch group size on the warp scheduler. Throughout this paper, we find that the size of fetch group is...

Hardware implementation of a tessellation accelerator for the OpenVG standard

Article

Mar 2010

The OpenVG standard has been introduced as an efficient vector graphics API for embedded systems. There have been several OpenVG implementations that are based on the software rendering of image. However, the software rendering needs more execution time and power consumption than hardware accelerated rendering. For the efficient hardware implementa...

Types of max-scaled numerical encodings.

Details of the sparse fine-tuning experiments

OPT-125M and LLaMA-2-7B perplexities on WikiText2 for the combination...

Comparison of evaluation cross-entropy loss with estimated...

Effective Interplay between Sparsity and Quantization: From Theory to Practice

Preprint

Full-text available

May 2024

The increasing size of deep neural networks necessitates effective model compression to improve computational efficiency and reduce their memory footprint. Sparsity and quantization are two prominent compression methods that have individually demonstrated significant reduction in computational and memory footprints while preserving model accuracy....

Conflict-aware compiler for hierarchical register file on GPUs

Article

Feb 2024

Adaptive Kernel Merge and Fusion for Multi-Tenant Inference in Embedded GPUs

Article

Jan 2024

This paper proposes a new scheme that improves throughput and reduces queuing delay while running multiple inferences in embedded GPU-based systems. We observe that an embedded system runs inference with a fixed number of deep learning models and that inference requests often use the same model. Unlike prior work that proposed kernel fusion or sche...

SAVector: Vectored Systolic Arrays

Article

Full-text available

Jan 2024

Conventional DNN inference accelerators are designed with a few (up to four) large systolic arrays. As such a scale-up architecture often suffers from low utilization, a scale-out architecture, in which a single accelerator has tens of pods and each pod has a small systolic array, has been proposed. While the scale-out architecture is promising, it...

MAD MAcce: Supporting Multiply-Add Operations for Democratizing Matrix-Multiplication Accelerators

Conference Paper

Dec 2023

Warped-MC: An Efficient Memory Controller Scheme for Massively Parallel Processors

Conference Paper

Sep 2023

Performance Analysis of Neural Processing Units with Emerging Memory Technologies

Article

Jul 2023

R2D2: Removing ReDunDancy Utilizing Linearity of Address Generation in GPUs

Conference Paper

Jun 2023

SnakeByte: A TLB Design with Adaptive and Recursive Page Merging in GPUs

Conference Paper

Feb 2023

CASH-RF: A Compiler-Assisted Hierarchical Register File in GPUs

Article

Dec 2022

Spin-Transfer Torque Magnetic Random-Access Memory (STT-MRAM) is an emerging non-volatile memory technology that has been received significant attention due to its higher density and lower leakage current over SRAM. One compelling use case is to employ STT-MRAM as a GPU Register File (RF) to reduce its massive energy consumption. One critical chall...

Fig. 9: Basic operation performance (Queries per Minute (QpM),...

Fig. 10: Database performance (tpmC in NewOrder and Payment. QpM in...

FLIXR: Embedding Index into Flash Translation Layer in SSDs

Article

Full-text available

Jan 2022

Flash memory technologies rely on the flash translation layer (FTL) to manage no in-place update and garbage collection. Current FTL management schemes do not exploit the semantics of the accessed data. In this paper, we explore how semantic knowledge can be exploited to build and maintain indexes for stored data automatically. Data indexing is a c...

FIGURE 2. GPU architecture with hierarchical register file.

FIGURE 3. Index calculation steps of conventional SRAM-based...

FIGURE 4. Normalized IPC for seven different configurations: Baseline,...

FIGURE 5. Normalized register write operations (lower is better) for...

TEA-RC: Thread Context-Aware Register Cache for GPUs

Article

Full-text available

Jan 2022

Graphics processing units (GPUs) achieve high throughput by exploiting a high degree of thread-level parallelism (TLP). To support such high TLP, GPUs have a large-sized register file to store the context of all threads, consuming around 20% of total GPU energy. Several previous studies have attempted to minimize the energy consumption of the regis...

FIGURE 6. L1 cache hit rates of individual loads.

FIGURE 10. Fraction of issued instructions in the sparse-format kernel.

Analyzing GCN Aggregation on GPU

Article

Full-text available

Jan 2022

Graph convolutional neural networks (GCNs) are emerging neural networks for graph structures that include large features associated with each vertex. The operations of GCN can be divided into two phases - aggregation and combination. While the combination just performs matrix multiplications using trained weights and aggregated features, the aggreg...

FIGURE 2. Coalescer and the number of memory requests.

FIGURE 6. Secure bit generation for identifying secure loads.

FIGURE 7. Modified load/store unit for GhostLeg.

GhostLeg: Selective Memory Coalescing for Secure GPU Architecture

Article

Full-text available

Jan 2022

Architectural considerations for secure executions are getting more critical for GPU since popular security applications and libraries have been ported to a GPU domain to rely on GPU’s massively parallel computations. Recent studies disclosed the security attack models that exploit GPU’s architectural vulnerabilities to leak the secret keys of AES....

Duplo: Lifting Redundant Memory Accesses of Deep Neural Networks for GPU Tensor Cores

Conference Paper

Oct 2020

Linebacker: preserving victim cache lines in idle register files of GPUs

Conference Paper

Jun 2019

Modern GPUs suffer from cache contention due to the limited cache size that is shared across tens of concurrently running warps. To increase the per-warp cache size prior techniques proposed warp throttling which limits the number of active warps. Warp throttling leaves several registers to be dynamically unused whenever a warp is throttled. Given...

Adaptive Cooperation of Prefetching and Warp Scheduling on GPUs

Article

Oct 2018

This paper proposes a new architecture, called Adaptive PREfetching and Scheduling (APRES), which improves cache efficiency of GPUs. APRES relies on the observation that GPU loads tend to have either high locality or strided access patterns across warps. APRES schedules warps so that as many cache hits are generated as possible before the generatio...

FineReg: Fine-Grained Register File Management for Augmenting GPU Throughput

Conference Paper

Oct 2018

WASP: Selective Data Prefetching with Monitoring Runtime Warp Progress on GPUs

Article

Mar 2018

This paper proposes a new data prefetching technique for Graphics Processing Units (GPUs) called Warp Aware Selective Prefetching (WASP). The main idea of WASP is to dynamically select warps whose progress is slower than that of the current warp as prefetching target warps. Under the in-order instruction execution model of GPUs, these prefetching t...

Access Pattern-Aware Cache Management for Improving Data Utilization in GPU

Conference Paper

Jun 2017

Long latency of memory operation is a prominent performance bottleneck in graphics processing units (GPUs). The small data cache that must be shared across dozens of warps (a collection of threads) creates significant cache contention and premature data eviction. Prior works have recognized this problem and proposed warp throttling which reduces th...

Access Pattern-Aware Cache Management for Improving Data Utilization in GPU

Article

Jun 2017

Dynamic Resizing on Active Warps Scheduler to Hide Operation Stalls on GPUs

Article

May 2017

This paper conducts a detailed study of the factors affecting the operation stalls in terms of the fetch group size on the warp scheduler of GPUs. Throughout this paper, we reveal that the size of a fetch group is highly involved for hiding various types of operation stalls: short latency stalls, long latency stalls, and Load/Store Unit (LSU) stall...

Network

Norman P. Jouppi
Google Inc.
Ankit Sethia
University of Michigan
Li-Wen Chang
Microsoft
Jiayuan Meng
Argonne National Laboratory
Hyesoon Kim
Georgia Institute of Technology

Viswanadha raju SOmalaraju
Jawaharlal Nehru Technological University, Hyderabad
Hamid Sarbazi-Azad
Institute for Research in Fundamental Sciences (IPM)
Juan Gómez-Luna
University of Cordoba (Spain)
Sheng Ma
National University of Defense Technology
Nam Kim
University of Wisconsin–Madison

Yunho OhYonsei University · School of Electrical and Electronic Engineering

About

Publications

Network

Yunho Oh
Yonsei University · School of Electrical and Electronic Engineering