An overall architecture of PIM.

An overall architecture of PIM.

Source publication
Article
Full-text available
The computing domain of today’s computer systems is moving very fast from arithmetic to data processing as data volumes grow exponentially. As a result, PIM (Processing-in-Memory) studies have been actively conducted to support the data processing in or near memory devices to address the limited bandwidth and high power consumption due to data move...

Contexts in source publication

Context 1
... this section, we present the detailed implementation of our PIM by using the matrix-vector multiplication which is a core operation for many types of emerging applications such as neural network executions, AR/VR, and so on. Figure 5 shows our PIM overall architecture, which is primarily divided into three components: 1) a software stack to consist of the PIM application, the PIM library, and the PIM device driver, 2) a memory controller, and 3) the PIM device. ...
Context 2
... row miss overhead was hidden by reducing the idle cycle and increasing the overlap between memory and computing operations. For more detailed performance study, we show the cycle breakdown of the execution at the delay of (8,4), (16,8), and (32,16) in Figure 15. The performance tolerance from the row misses in our all schedulings is one of the significant advantages of our design. ...

Citations

... According to the study, scheduling across all banks and inside each bank saw performance improvements of406% and 35.2%, respectively. Other studies [23]- [24] offer creative PIM methods to quicken neural networks. ...
Article
Researchers studying computer architecture are paying close attention to processing-in-memory (PIM) techniques and a lot of time and money has been spent exploring and developing them. It is hoped that increasing the amount of research done on PIM approaches will help fulfill PIM’s promise to eliminate or greatly minimize memory access bottleneck issues for memory-intensive applications. In order to uncover unresolved issues, empower the research community to make wise judgments, and modify future research trajectories, we also think it is critical to keep track of PIM research advancements. In this review, we examine recent research that investigated PIM methodologies, highlight the innovations, contrast contemporary PIM designs, and pinpoint the target application domains and appropriate memory technologies. We also talk about ideas that address problems with PIM designs that have not yet been solved, such as the translation and mapping of operands, workload analysis to find application segments that can be sped up with PIM, OS/runtime support, and coherency problems that need to be fixed before PIM can be used. We think that this work can be a helpful resource for researchers looking into PIM methods.
... To use these VIMA instructions, as with most PIM techniques, specific instructions need to be inserted by the programmer with libraries and intrinsic functions [2,17,18], by the compiler [1,20], or by runtime devices [3,6]. However, libraries and intrinsics rely on the programmer's knowledge, making development difficult and error-prone. ...
Preprint
Full-text available
The growing data demands in modern applications aggravated persistent problems in computer architecture, such as the Memory Wall. Processing-In-Memory(PIM) devices emerged as an alternative to mitigate these problems, approaching processing and storage units to reduce data traffic between processors and memory. Literature and industry proposed different PIM designs and a prominent approach aims to expose memory bandwidth by placing large Single Instruction-Multiple Data (SIMD) or vector units in memory. One method to use these new processing units is runtime binary translators. Simple AVX to PIM Vectorizer(SAPIVe) provides transparency to the users and specific software independence by adopting a hardware-based binary translator to convert the application’s vector instructions to PIM instructions. In this work, the authors expand SAPIVe evaluations, explanations, and discussions, seeking to deepen the description of the mechanism and improve the understanding of its behavior and performance.
... Despite the PIM's performance advantages, in-DRAM PIM has yet to be commercially available because of the following two factors. First, most designs have pursued "accelerator the first approach" instead of "memory the first," affecting all architecture layers' design, such as cores [9], [11], [12] and memory controllers [14], [15], [17], [18], [24], thus incompatible with our current computing platforms. For example, the latest PIM studies from Samsung [23] and UPMEM [30] separated the PIM memory area from the non-PIM memory to avoid incompatibility with the JEDEC memory standard [31] for supporting all-bank execution. ...
Article
Full-text available
Processing-in-memory (PIM) has attracted attention to overcome the memory bandwidth limitation, especially for computing memory-intensive DNN applications. Most PIM approaches use the CPU’s memory requests to deliver instructions and operands to the PIM engines, making a core busy and incurring unnecessary data transfer, thus, resulting in significant offloading overhead. DMA can resolve the issue by transferring a high volume of successive data without intervening CPU and polluting the memory hierarchy, thus perfectly fitting the PIM concept. However, the small computing resources of DRAM-based PIM devices allow us to transfer only small amounts of data at one DMA transaction and require a large number of descriptors, thus still incurring significant offloading overhead. This paper introduces PIM Instruction Set Architecture (ISA) using a DMA descriptor called PISA-DMA to express a PIM opcode and operand in a single descriptor. Our ISA makes PIM programming intuitive by thinking of committing one PIM instruction as completing one DMA transaction and representing a sequence of PIM instructions using the DMA descriptor list. Also, PISA-DMA minimizes the offloading overhead while guaranteeing compatibility with commercial platforms. Our PISA-DMA eliminates the opcode offloading overhead and achieves 1.25x, 1.31x, and 1.29x speedup over the baseline PIM at the sequence length of 128 with the BERT, RoBERTa, and GPT-2 models, respectively, in ONNX runtime with real machines. Also, we study how our proposed PISA affects performance in compiler optimization and show that the operator fusion of matrix-matrix multiplication and element-wise addition achieved 1.04x speedup, a similar performance gain using conventional ISAs.
... Processing-in-Memory (PIM) has been proposed to alleviate the problem by placing the computing units in the memory [3]- [6], [12]- [21]. The in-bank PIM architecture, typically based on DRAMs, places the PIM engines in the bank peripherals, but the space for the engine implementation is very constrained [3]- [6], [12], [21]. ...
... Processing-in-Memory (PIM) has been proposed to alleviate the problem by placing the computing units in the memory [3]- [6], [12]- [21]. The in-bank PIM architecture, typically based on DRAMs, places the PIM engines in the bank peripherals, but the space for the engine implementation is very constrained [3]- [6], [12], [21]. Also, we should carefully consider the engine operation's power consumption since the high-temperature results in increased unreliable cells, refresh rates, and traffic throttling, leading to performance degradation [22], [23]. ...
... The Level-3 BLAS (i.e., matrix-matrix (MM) multiplication) is more complicated than VM: either a multiplicand row is shared by multiplier columns or a multiplier column is shared by multiplicand rows, as depicted in Figure 1. However, most existing in-bank PIMs are designed for Level-2 BLAS, so they execute Level-3 BLAS by repeating VM multiplications by the number of rows of the multiplicand [3]- [6], [12], [21]. ...
Article
Full-text available
Processing-in-Memory (PIM) has been actively studied to overcome the memory bottleneck by placing computing units near or in memory, especially for efficiently processing low locality data-intensive applications. We can categorize the in-DRAM PIMs depending on how many banks perform the PIM computation by one DRAM command: per-bank and all-bank. The per-bank PIM operates only one bank, delivering low performance but preserving the standard DRAM interface and servicing non-PIM requests during PIM execution. The all-bank PIM operates all banks, achieving high performance but accompanying design issues like thermal and power consumption. We introduce the memory-computation decoupling execution to achieve the ideal all-bank PIM performance while preserving the standard JEDEC DRAM interface, i.e., performing the per-bank execution, thus easily adapted to commercial platforms. We divide the PIM execution into two phases: memory and computation phases. At the memory phase, we read the bank-private operands from a bank and store them in PIM engines’ registers bank-by-bank. At the computation phase, we decouple the PIM engine from the memory array and broadcast a bank-shared operand using a standard read/write command to make all banks perform the computation simultaneously, thus reaching the computing throughput of the all-bank PIM. For extending the computation phase, i.e., maximizing all-bank execution opportunity, we introduce a compiler analysis and code generation technique to identify the bank-private and the bank-shared operands. We compared the performance of Level-2/3 BLAS, multi-batch LSTM-based Seq2Seq model, and BERT on our decoupled PIM with commercial computing platforms. In Level-3 BLAS, we achieved speedups of 75.8x, 1.2x, and 4.7x compared to CPU, GPU, and the per-bank PIM and up to 91.4% of the ideal all-bank PIM performance. Furthermore, our decoupled PIM consumed less energy than GPU and the per-bank PIM by 72.0% and 78.4%, but 7.4%, a little more than the ideal all-bank PIM.
... We designed the PE using Verilog and verified it on FPGA. Our experimental platform is the same as [11]: The PyTorch-based target application [8] is executed in the host processor, and MAC and activation function operations are performed in the HTG-Z920 FPGA [12] connected to the PCI-e bus. ...
... The PIM applications repeatedly perform the same operation only with different data inputs and outputs. Therefore, sending the command for every computation is very inefficient [20]. In addition, the memory model was also carefully studied for efficiently developing PIM programming and execution models. ...
... Our PIM engine supports our argument since the DRAM column command can be issued sequentially at 4-cycle (tCCD) intervals. To validate the proposed design to be implemented in DRAM die, we used the 65nm logic process, similar to the DRAM fabrication characteristics [20]. We verified that our design met the DRAM internal operating frequency of about 800MHz, and the available space near each bank of about 40,000µm 2 . ...
... In the case of data-intensive applications targeting by PIM, the same sequence of operations is repeatedly performed on a large amount of data. Therefore, repetitively sending the PIM operation commands from the host to the PIM device incurs unnecessary significant overhead [20]. ...
Article
The Deep Neural Network (DNN), Recurrent Neural Network (RNN) applications, rapidly becoming attractive to the market, process a large amount of low-locality data; thus, the memory bandwidth limits their peak performance. Therefore, many data centers actively adapt high-bandwidth memory like HBM2/HBM2E to resolve the problem. However, this approach would not provide a complete solution since it still transfers the data from the memory to the computing unit. Thus, processing-in-memory (PIM), which performs the computation inside memory, has attracted attention. However, most previous methods require the modification or the extension of core pipelines and memory system components like memory controllers, making the practical implementation of PIM very challenging and expensive in development. In this article, we propose a Silent-PIM that performs the PIM computation with standard DRAM memory requests; thus, requiring no hardware modifications and allowing the PIM memory device to perform the computation while servicing non-PIM applications’ memory requests. We can achieve our design goal by preserving the standard memory request behaviors and satisfying the DRAM standard timing requirements. In addition, using standard memory requests makes it possible to use DMA as a PIM’s offloading engine, resulting in processing the PIM memory requests fast and making a core perform other tasks. We compared the performance of three Long Short-Term Memory models (LSTM) kernels on real platforms, such as the Silent-PIM modeled on the FPGA, GPU, and CPU. For $(p \times 512) \times (512 \times 2048)$ matrix multiplication with a batch size $p$ varying from 1 to 128, the Silent-PIM performed up to 16.9x and 24.6x faster than GPU and CPU, respectively, $p=1$ , which was the case without having any data reuse. At $p=128$ , the highest data reuse case, the GPU performance was the highest, but the PIM performance was still higher than the CPU execution. Similarly, at $(p \times 2048)$ element-wise multiplication and addition, where there was no data reuse, the Silent-PIM always achieved higher than both CPU and GPU. It also showed that when the PIM’s EDP performance was superior to the others in all the cases having no data reuse.
Conference Paper
The von Neumann bottleneck is imposed due to the explosion of data transfers and emerging data-intensive applications in heterogeneous system architectures. The conventional computation approach of transferring data to CPU is no longer suitable especially with the cost it imposes. Hence, high-performance data processing mechanisms are needed, which may be demonstrated in bringing computation closer to data. Generating insight where data is stored helps deal with energy efficiency, low latency, storage bus bandwidth and security. In this paper, we explore near data processing focusing on Processing-In-Memory (PIM) to address data integrity through the inclusion of CRC compute units within the DRAM controller as well as address data security through encompassing an Advanced Encryption Standard (AES) acceleration unit. Our experimental results are based against an HBM3 platform. We achieved a computing memory with minimal overhead of commands, area and same power consumption compared to a non-computing one.
Article
The von Neumann bottleneck is imposed due to the explosion of data transfers and emerging data-intensive applications in heterogeneous system architectures. The conventional computation approach of transferring data to CPU is no longer suitable especially with the cost it imposes. Given the increasing storage capacities, moving extensive data volumes between storage and computation cannot scale up. Hence, high-performance data processing mechanisms are needed, which may be achieved by bringing computation closer to data. Gathering insights where data is stored helps deal with energy efficiency, low latency, as well as security. Storage bus bandwidth is also saved when only computation results are delivered to the host memory. Various applications, including database acceleration, machine learning, Artificial Intelligence (AI), offloading (compression/encryption/encoding) and others can perform better and become more scalable if the “move process to data” paradigm is applied. Embedding processing engines inside Solid-State Drives (SSDs), transforming them to Computational Storage Devices (CSDs), provides the needed data processing solution. In this paper, we review the prior art on Near Data Processing (NDP) with focus on In-Storage Computing (ISC), identifying main challenges and potential gaps for future research directions.