An overall architecture of PIM.

Source publication

FIGURE 3. Mapping between the physical address of OS page and DRAM...

FIGURE 4. The state diagram for both the standard DRAM and the PIM...

FIGURE 5. An overall architecture of PIM.

FIGURE 8. An example of the PIM application for the matrix-vector...

FIGURE 12. Cycle breakdown of the bank execution with p = 1024. (a)...

Design of Processing-"Inside"-Memory Optimized for DRAM Behaviors

Article

Full-text available

Jun 2019

The computing domain of today’s computer systems is moving very fast from arithmetic to data processing as data volumes grow exponentially. As a result, PIM (Processing-in-Memory) studies have been actively conducted to support the data processing in or near memory devices to address the limited bandwidth and high power consumption due to data move...

Context 1

... this section, we present the detailed implementation of our PIM by using the matrix-vector multiplication which is a core operation for many types of emerging applications such as neural network executions, AR/VR, and so on. Figure 5 shows our PIM overall architecture, which is primarily divided into three components: 1) a software stack to consist of the PIM application, the PIM library, and the PIM device driver, 2) a memory controller, and 3) the PIM device. ...

View in full-text

Context 2

... row miss overhead was hidden by reducing the idle cycle and increasing the overlap between memory and computing operations. For more detailed performance study, we show the cycle breakdown of the execution at the delay of (8,4), (16,8), and (32,16) in Figure 15. The performance tolerance from the row misses in our all schedulings is one of the significant advantages of our design. ...

View in full-text

Processing-In-Memory Techniques: Survey, Advances, and Challenges

Article

May 2024

Maj Vipul Pathak

Researchers studying computer architecture are paying close attention to processing-in-memory (PIM) techniques and a lot of time and money has been spent exploring and developing them. It is hoped that increasing the amount of research done on PIM approaches will help fulfill PIM’s promise to eliminate or greatly minimize memory access bottleneck issues for memory-intensive applications. In order to uncover unresolved issues, empower the research community to make wise judgments, and modify future research trajectories, we also think it is critical to keep track of PIM research advancements. In this review, we examine recent research that investigated PIM methodologies, highlight the innovations, contrast contemporary PIM designs, and pinpoint the target application domains and appropriate memory technologies. We also talk about ideas that address problems with PIM designs that have not yet been solved, such as the translation and mapping of operands, workload analysis to find application segments that can be sped up with PIM, OS/runtime support, and coherency problems that need to be fixed before PIM can be used. We think that this work can be a helpful resource for researchers looking into PIM methods.

A Simple AVX to PIM Vectorizer (SAPIVe)

Preprint

Full-text available

Jul 2023

The growing data demands in modern applications aggravated persistent problems in computer architecture, such as the Memory Wall. Processing-In-Memory(PIM) devices emerged as an alternative to mitigate these problems, approaching processing and storage units to reduce data traffic between processors and memory. Literature and industry proposed different PIM designs and a prominent approach aims to expose memory bandwidth by placing large Single Instruction-Multiple Data (SIMD) or vector units in memory. One method to use these new processing units is runtime binary translators. Simple AVX to PIM Vectorizer(SAPIVe) provides transparency to the users and specific software independence by adopting a hardware-based binary translator to convert the application’s vector instructions to PIM instructions. In this work, the authors expand SAPIVe evaluations, explanations, and discussions, seeking to deepen the description of the mechanism and improve the understanding of its behavior and performance.

PISA-DMA: Processing-in-Memory Instruction Set Architecture Using DMA

Article

Full-text available

Jan 2023

Processing-in-memory (PIM) has attracted attention to overcome the memory bandwidth limitation, especially for computing memory-intensive DNN applications. Most PIM approaches use the CPU’s memory requests to deliver instructions and operands to the PIM engines, making a core busy and incurring unnecessary data transfer, thus, resulting in significant offloading overhead. DMA can resolve the issue by transferring a high volume of successive data without intervening CPU and polluting the memory hierarchy, thus perfectly fitting the PIM concept. However, the small computing resources of DRAM-based PIM devices allow us to transfer only small amounts of data at one DMA transaction and require a large number of descriptors, thus still incurring significant offloading overhead. This paper introduces PIM Instruction Set Architecture (ISA) using a DMA descriptor called PISA-DMA to express a PIM opcode and operand in a single descriptor. Our ISA makes PIM programming intuitive by thinking of committing one PIM instruction as completing one DMA transaction and representing a sequence of PIM instructions using the DMA descriptor list. Also, PISA-DMA minimizes the offloading overhead while guaranteeing compatibility with commercial platforms. Our PISA-DMA eliminates the opcode offloading overhead and achieves 1.25x, 1.31x, and 1.29x speedup over the baseline PIM at the sequence length of 128 with the BERT, RoBERTa, and GPT-2 models, respectively, in ONNX runtime with real machines. Also, we study how our proposed PISA affects performance in compiler optimization and show that the operator fusion of matrix-matrix multiplication and element-wise addition achieved 1.04x speedup, a similar performance gain using conventional ISAs.

Achieving the Performance of All-bank In-DRAM PIM with Standard Memory Interface: Memory-Computation Decoupling

Article

Full-text available

Jan 2022

Processing-in-Memory (PIM) has been actively studied to overcome the memory bottleneck by placing computing units near or in memory, especially for efficiently processing low locality data-intensive applications. We can categorize the in-DRAM PIMs depending on how many banks perform the PIM computation by one DRAM command: per-bank and all-bank. The per-bank PIM operates only one bank, delivering low performance but preserving the standard DRAM interface and servicing non-PIM requests during PIM execution. The all-bank PIM operates all banks, achieving high performance but accompanying design issues like thermal and power consumption. We introduce the memory-computation decoupling execution to achieve the ideal all-bank PIM performance while preserving the standard JEDEC DRAM interface, i.e., performing the per-bank execution, thus easily adapted to commercial platforms. We divide the PIM execution into two phases: memory and computation phases. At the memory phase, we read the bank-private operands from a bank and store them in PIM engines’ registers bank-by-bank. At the computation phase, we decouple the PIM engine from the memory array and broadcast a bank-shared operand using a standard read/write command to make all banks perform the computation simultaneously, thus reaching the computing throughput of the all-bank PIM. For extending the computation phase, i.e., maximizing all-bank execution opportunity, we introduce a compiler analysis and code generation technique to identify the bank-private and the bank-shared operands. We compared the performance of Level-2/3 BLAS, multi-batch LSTM-based Seq2Seq model, and BERT on our decoupled PIM with commercial computing platforms. In Level-3 BLAS, we achieved speedups of 75.8x, 1.2x, and 4.7x compared to CPU, GPU, and the per-bank PIM and up to 91.4% of the ideal all-bank PIM performance. Furthermore, our decoupled PIM consumed less energy than GPU and the per-bank PIM by 72.0% and 78.4%, but 7.4%, a little more than the ideal all-bank PIM.

Implementation of Pipelined Adder Tree for Long Short-Term Memory Cells

Conference Paper

Full-text available

Jun 2021

Silent-PIM: Realizing the Processing-in-Memory Computing With Standard Memory Requests

Article

Mar 2021
IEEE T PARALL DISTR

The Deep Neural Network (DNN), Recurrent Neural Network (RNN) applications, rapidly becoming attractive to the market, process a large amount of low-locality data; thus, the memory bandwidth limits their peak performance. Therefore, many data centers actively adapt high-bandwidth memory like HBM2/HBM2E to resolve the problem. However, this approach would not provide a complete solution since it still transfers the data from the memory to the computing unit. Thus, processing-in-memory (PIM), which performs the computation inside memory, has attracted attention. However, most previous methods require the modification or the extension of core pipelines and memory system components like memory controllers, making the practical implementation of PIM very challenging and expensive in development. In this article, we propose a Silent-PIM that performs the PIM computation with standard DRAM memory requests; thus, requiring no hardware modifications and allowing the PIM memory device to perform the computation while servicing non-PIM applications’ memory requests. We can achieve our design goal by preserving the standard memory request behaviors and satisfying the DRAM standard timing requirements. In addition, using standard memory requests makes it possible to use DMA as a PIM’s offloading engine, resulting in processing the PIM memory requests fast and making a core perform other tasks. We compared the performance of three Long Short-Term Memory models (LSTM) kernels on real platforms, such as the Silent-PIM modeled on the FPGA, GPU, and CPU. For $(p \times 512) \times (512 \times 2048)$ matrix multiplication with a batch size $p$ varying from 1 to 128, the Silent-PIM performed up to 16.9x and 24.6x faster than GPU and CPU, respectively, $p=1$ , which was the case without having any data reuse. At $p=128$ , the highest data reuse case, the GPU performance was the highest, but the PIM performance was still higher than the CPU execution. Similarly, at $(p \times 2048)$ element-wise multiplication and addition, where there was no data reuse, the Silent-PIM always achieved higher than both CPU and GPU. It also showed that when the PIM’s EDP performance was superior to the others in all the cases having no data reuse.

Low Overhead PIM-to-PIM Communication on PCIe-based Multi-PIM Platforms for Executing Large-Scale AI Models

Conference Paper

Jan 2024

An HBM3 Processing-In-Memory Architecture for Security and Data Integrity: Case Study

Conference Paper

Nov 2023

The von Neumann bottleneck is imposed due to the explosion of data transfers and emerging data-intensive applications in heterogeneous system architectures. The conventional computation approach of transferring data to CPU is no longer suitable especially with the cost it imposes. Hence, high-performance data processing mechanisms are needed, which may be demonstrated in bringing computation closer to data. Generating insight where data is stored helps deal with energy efficiency, low latency, storage bus bandwidth and security. In this paper, we explore near data processing focusing on Processing-In-Memory (PIM) to address data integrity through the inclusion of CRC compute units within the DRAM controller as well as address data security through encompassing an Advanced Encryption Standard (AES) acceleration unit. Our experimental results are based against an HBM3 platform. We achieved a computing memory with minimal overhead of commands, area and same power consumption compared to a non-computing one.

A review on computational storage devices and near memory computing for high performance applications

Article

Apr 2023

The von Neumann bottleneck is imposed due to the explosion of data transfers and emerging data-intensive applications in heterogeneous system architectures. The conventional computation approach of transferring data to CPU is no longer suitable especially with the cost it imposes. Given the increasing storage capacities, moving extensive data volumes between storage and computation cannot scale up. Hence, high-performance data processing mechanisms are needed, which may be achieved by bringing computation closer to data. Gathering insights where data is stored helps deal with energy efficiency, low latency, as well as security. Storage bus bandwidth is also saved when only computation results are delivered to the host memory. Various applications, including database acceleration, machine learning, Artificial Intelligence (AI), offloading (compression/encryption/encoding) and others can perform better and become more scalable if the “move process to data” paradigm is applied. Embedding processing engines inside Solid-State Drives (SSDs), transforming them to Computational Storage Devices (CSDs), provides the needed data processing solution. In this paper, we review the prior art on Near Data Processing (NDP) with focus on In-Storage Computing (ISC), identifying main challenges and potential gaps for future research directions.

Analog Compute-in-Memory For AI Edge Inference

Conference Paper

Dec 2022

D. Fick

An overall architecture of PIM.

Contexts in source publication

Citations