Figure 5 - uploaded by Donghyuk Lee
Content may be subject to copyright.
Pipelined Serial Mode of RowClone  

Pipelined Serial Mode of RowClone  

Source publication
Conference Paper
Full-text available
Several system-level operations trigger bulk data copy or initialization. Even though these bulk data operations do not require any computation, current systems transfer a large quantity of data back and forth on the memory channel to perform such operations. As a result, bulk data operations consume high latency, bandwidth, and energy--degrading b...

Context in source publication

Context 1
... READ/WRITE which interact with the memory channel connecting the processor and main memory, TRANSFER does not transfer data outside the chip. Figure 5 pictorially compares the operation of the TRANSFER command with that of READ and WRITE. The dashed lines indicate the data ow corresponding to the three commands. ...

Similar publications

Article
Full-text available
The growing gap between microprocessor speed and DRAM speed is a major problem that computer designers are facing. In order to narrow the gap, it is necessary to improve DRAM's speed and throughput. Moreover, on multi-core platforms, DRAM memory shared by all cores usually suffers from the memory contention and interference problem, which can cause...
Conference Paper
Full-text available
Application-specific accelerators may provide considerable speedup in single-core systems with a runtime-reconfigurable fabric (for simplicity called "fabric" in the following). A reconfigurable core, i.e. processor core pipeline coupled to a fabric, can be integrated along with regular general purpose processor cores (GPPs) into a reconfigurable m...

Citations

... A number of studies have investigated rapid DRAM initialization techniques. Seshadri et al., proposed RowClone, a technique for initializing contiguous DRAM sections with zeros [21]. Seol et al., proposed a RowReset, a hardware-efficient memory initialization solution that manipulates VDD and VSS to DRAM banks [22]. ...
Preprint
Full-text available
FPGA-based hardware accelerators are becoming increasingly popular due to their versatility, customizability, energy efficiency, constant latency, and scalability. FPGAs can be tailored to specific algorithms, enabling efficient hardware implementations that effectively leverage algorithm parallelism. This can lead to significant performance improvements over CPUs and GPUs, particularly for highly parallel applications. For example, a recent study found that Stratix 10 FPGAs can achieve up to 90\% of the performance of a TitanX Pascal GPU while consuming less than 50\% of the power. This makes FPGAs an attractive choice for accelerating machine learning (ML) workloads. However, our research finds privacy and security vulnerabilities in existing Xilinx FPGA-based hardware acceleration solutions. These vulnerabilities arise from the lack of memory initialization and insufficient process isolation, which creates potential avenues for unauthorized access to private data used by processes. To illustrate this issue, we conducted experiments using a Xilinx ZCU104 board running the PetaLinux tool from Xilinx. We found that PetaLinux does not effectively clear memory locations associated with a terminated process, leaving them vulnerable to memory scraping attack (MSA). This paper makes two main contributions. The first contribution is an attack methodology of using the Xilinx debugger from a different user space. We find that we are able to access process IDs, virtual address spaces, and pagemaps of one user from a different user space because of lack of adequate process isolation. The second contribution is a methodology for characterizing terminated processes and accessing their private data. We illustrate this on Xilinx ML application library.
... Keeping this ever-increasing data in a secured backup is a challenging task in terms of performance, energy, and memory. While intelligent and efficient algorithms were proposed for bulk data movement in data centers using row-level cloning [16], integrity verification of the copy procedure is also extremely important. Moreover, in the age of cybersecurity and identity theft, data encryption is equally crucial. ...
Article
Full-text available
Big data applications are on the rise, and so is the number of data centers. The ever-increasing massive data pool needs to be periodically backed up in a secure environment. Moreover, a massive amount of securely backed-up data is required for training binary convolutional neural networks for image classification. XOR and XNOR operations are essential for large-scale data copy verification, encryption, and classification algorithms. The disproportionate speed of existing compute and memory units makes the von Neumann architecture inefficient to perform these Boolean operations. Compute-in-memory (CiM) has proved to be an optimum approach for such bulk computations. The existing CiM-based XOR/XNOR techniques either require multiple cycles for computing or add to the complexity of the fabrication process. Here, we propose a CMOS-based hardware topology for single-cycle in-memory XOR/XNOR operations. Our design provides at least 2× improvement in the latency compared with other existing CMOS-compatible solutions. We verify the proposed system through circuit/system-level simulations and evaluate its robustness using a 5000-point Monte Carlo variation analysis. This all-CMOS design paves the way for practical implementation of CiM XOR/XNOR at scaled technology nodes.
... To solve this issue, three rows in the array can be specially designed for logic operations. Additionally, before and after the logic operations, the RowClone operation 63 , which copies data in a source row to a destination row by using the same charge-sharing principle, should be performed to transfer the inputs and outputs within the array. ...
... Accounting for both memory structure and behavior facilitates deployment of the desired PiM logic at any desired location. Finally, to demonstrate the capabilities of the PiMulator platform, we demonstrate strategies to prototype and emulate both pioneering bitwise-PiM architectures such as RowClone [33], LISA [34], and Ambit [35] as well as generic PiM architectures such as Fulcrum [36] and BLIMP [37]. These prior works demonstrate highbandwidth, low-latency, and energy-efficient PiM operations at the subarray, bank, and chip level. ...
... • We demonstrate strategies for prototyping and design-space exploration of bitwise-PiM and more general/complex PiM architectures. We conduct PiM architecture evaluations by exploring various subarray configurations for RowClone [33] and Ambit [35], including intersubarray links as described in LISA [34]. Next, we showcase emulation and explore various configurations of PiM architectures that together cover a larger PiM space, such as Flucrum [36], DRISA [39], Sieve [40] and BLIMP [37]. ...
... PiDRAM [14] demonstrates RowClone FPM [33] data copy, Ambit-like [35] AND-OR-NOT operations and true random number generation using DDR3 memory by violating the memory timings. ...
Thesis
Full-text available
Motivated by the memory wall problem, researchers propose numerous Processing-in-Memory (PiM) architectures to bring computation closer to data. Evaluating the performance of these emerging architectures is challenging due to the lack of tools that accurately mimic both software and hardware aspects. This thesis presents PiMulator, an open-source platform for system-level PiM emulation, suitable for rapid prototyping and evaluation of PiM architectures. At its core, PiMulator incorporates MEMulator, a main-memory emulation model implemented in SystemVerilog, enabling users to generate any desired memory configuration on the FPGA fabric with complete control over the PiM logic units. Furthermore, we develop and implement the FreezeTime mechanism, effectively extending the emulated memory capacity by synchronizing the limited FPGA chip memory resources with the board's DDR4 and HBM2 resources. This approach offers flexibility in modeling memory and logic behavior without compromising emulated time accuracy. The platform integrates the Memory+PiM model into the LiteX framework, ensuring compatibility with a robust FPGA and RISC-V ecosystem. We adopt a system emulation approach that combines CPUs, memory controllers, memories, interconnect, and peripherals using soft cores synthesizable on FPGA boards. This enables architects to prototype, emulate, and evaluate various PiM architectures and designs at the system level. PiMulator facilitates high-speed and high-fidelity modeling and evaluation of emerging memory and PiM architectures with workloads of interest, utilizing soft cores synthesizable on FPGA boards. This thesis demonstrates strategies to model several pioneering PiM architectures and provides detailed benchmark performance results, showcasing the platform's ability to facilitate design space exploration.
... With the proliferation of today's domain-specific architectures, there exists a variety of PIM or near-memory processing studies [14], [16], [18], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40], [41], [42], [84], [85], [86], [87], [88], [89], [90], [91], [92], [93], [94], [95], [96], [97]. There is also several prior work on PIM exploring compiler issues [8], [10], [98], data coherency [11], [12], [13], [99], synchronization [100], QoS aware runtime and scheduling for PIM [101], among many others [102], [103], [104], [105], [106], [107], [108]. This paper focuses on characterizing the first real-world generalpurpose PIM via our PIMulator, pathfinding important research directions for future PIMs. ...
Preprint
Full-text available
Processing-in-memory (PIM) has been explored for decades by computer architects, yet it has never seen the light of day in real-world products due to their high design overheads and lack of a killer application. With the advent of critical memory-intensive workloads, several commercial PIM technologies have been introduced to the market ranging from domain-specific PIM architectures to more general-purpose PIM architectures. In this work, we deepdive into UPMEM's commercial PIM technology, a general-purpose PIM-enabled parallel architecture that is highly programmable. Our first key contribution is the development of a flexible simulation framework for PIM. The simulator we developed (aka PIMulator) enables the compilation of UPMEM-PIM source codes into its compiled machine-level instructions, which are subsequently consumed by our cycle-level performance simulator. Using PIMulator, we demystify UPMEM's PIM design through a detailed characterization study. Building on top of our characterization, we conduct a series of case studies to pathfind important architectural features that we deem will be critical for future PIM architectures to support
... To mitigate the data-movement bottleneck, In-Memory Computing (IMC) architectures were proposed to avoid moving data across the chips. The main idea is to process data where it exists in DRAMs (Seshadri et al., 2013;Farmahini-Farahani et al., 2015) or SRAMs (Jeloka et al., 2015;Eckert et al., 2018;Fujiki et al., 2019;Al-Hawaj et al., 2020). In SRAMs, the bitline computing approach is used to compute bitwise AND and NOR operations for two simultaneously activated wordlines. ...
Article
Full-text available
The applications of the Artificial Intelligence are currently dominating the technology landscape. Meanwhile, the conventional Von Neumann architectures are struggling with the data-movement bottleneck to meet the ever-increasing performance demands of these data-centric applications. Moreover, The vector-matrix multiplication cost, in the binary domain, is a major computational bottleneck for these applications. This paper introduces a novel digital in-memory stochastic computing architecture that leverages the simplicity of the stochastic computing for in-memory vector-matrix multiplication. The proposed architecture incorporates several new approaches including a new stochastic number generator with ideal binary-to-stochastic mapping, a best seeding approach for accurate-enough low stochastic bit-precisions, a hybrid stochastic-binary accumulation approach for vector-matrix multiplication, and the conversion of conventional memory read operations into on-the-fly stochastic multiplication operations with negligible overhead. Thanks to the combination of these approaches, the accuracy analysis of the vector-matrix multiplication benchmark shows that scaling down the stochastic bit-precision from 16-bit to 4-bit achieves nearly the same average error (less than 3%). The derived analytical model of the proposed in-memory stochastic computing architecture demonstrates that the 4-bit stochastic architecture achieves the highest throughput per sub-array (122 Ops/Cycle), which is better than the 16-bit stochastic precision by 4.36x, while still maintaining a small average error of 2.25%.
... PIM techniques come in various types and technologies [3][4][5][6][7][8][9][10], all operate on data at (or close to) where it is stored, i.e., the memory. In this paper, we are interested in accelerating analytical processing of relational databases with a specific PIM technique called bulk-bitwise PIM [3][4][5][6][7]. ...
Preprint
Full-text available
Online Analytical Processing (OLAP) for relational databases is a business decision support application. The application receives queries about the business database, usually requesting to summarize many database records, and produces few results. Existing OLAP requires transferring a large amount of data between the memory and the CPU, having a few operations per datum, and producing a small output. Hence, OLAP is a good candidate for processing-in-memory (PIM), where computation is performed where the data is stored, thus accelerating applications by reducing data movement between the memory and CPU. In particular, bulk-bitwise PIM, where the memory array is a bit-vector processing unit, seems a good match for OLAP. With the extensive inherent parallelism and minimal data movement of bulk-bitwise PIM, OLAP applications can process the entire database in parallel in memory, transferring only the results to the CPU. This paper shows a full stack adaptation of a bulk-bitwise PIM, from compiling SQL to hardware implementation, for supporting OLAP applications. Evaluating the Star Schema Benchmark (SSB), bulk-bitwise PIM achieves a 4.65X speedup over Monet-DB, a standard database system.
... Row-copy [2], [11] is an out-of-specification in-memory operation that copies the value of one row to another row within the same subarray using charge-sharing through the BL. First, a source row is activated. ...
Preprint
The demand for accurate information about the internal structure and characteristics of dynamic random-access memory (DRAM) has been on the rise. Recent studies have explored the structure and characteristics of DRAM to improve processing in memory, enhance reliability, and mitigate a vulnerability known as rowhammer. However, DRAM manufacturers only disclose limited information through official documents, making it difficult to find specific information about actual DRAM devices. This paper presents reliable findings on the internal structure and characteristics of DRAM using activate-induced bitflips (AIBs), retention time test, and row-copy operation. While previous studies have attempted to understand the internal behaviors of DRAM devices, they have only shown results without identifying the causes or have analyzed DRAM modules rather than individual chips. We first uncover the size, structure, and operation of DRAM subarrays and verify our findings on the characteristics of DRAM. Then, we correct misunderstood information related to AIBs and demonstrate experimental results supporting the cause of rowhammer. We expect that the information we uncover about the structure, behavior, and characteristics of DRAM will help future DRAM research.
... (e.g., 16) DRAM banks. A DRAM bank (❺) consists of hundreds of subarrays [93][94][95]. A DRAM subarray (❻) has many DRAM cells that are laid in a two-dimensional array of rows and columns, and a row buffer. ...
... PuM approaches have been demonstrated in DRAM (e.g., [3,[5][6][7][8][9][10]), NVM (e.g., [11][12][13]), NAND flash (e.g., [14,15]) and SRAM (e.g., [16,17]) devices. For example, recent works [5,18,19] show that data copy and initialization can be performed inside a DRAM chip by exploiting internal connectivity in the DRAM chip, even in existing real DRAM chips that do not explicitly support these operations. ...
... PuM approaches have been demonstrated in DRAM (e.g., [3,[5][6][7][8][9][10]), NVM (e.g., [11][12][13]), NAND flash (e.g., [14,15]) and SRAM (e.g., [16,17]) devices. For example, recent works [5,18,19] show that data copy and initialization can be performed inside a DRAM chip by exploiting internal connectivity in the DRAM chip, even in existing real DRAM chips that do not explicitly support these operations. Latency of a 4KB data copy can be improved by more than 11X and energy by 77X compared to a state-of-the-art processor-centric solution. ...
Preprint
Full-text available
Memory-centric computing aims to enable computation capability in and near all places where data is generated and stored. As such, it can greatly reduce the large negative performance and energy impact of data access and data movement, by fundamentally avoiding data movement and reducing data access latency & energy. Many recent studies show that memory-centric computing can greatly improve system performance and energy efficiency. Major industrial vendors and startup companies have also recently introduced memory chips that have sophisticated computation capabilities. This talk describes promising ongoing research and development efforts in memory-centric computing. We classify such efforts into two major fundamental categories: 1) processing using memory, which exploits analog operational properties of memory structures to perform massively-parallel operations in memory, and 2) processing near memory, which integrates processing capability in memory controllers, the logic layer of 3D-stacked memory technologies, or memory chips to enable high-bandwidth and low-latency memory access to near-memory logic. We show both types of architectures (and their combination) can enable orders of magnitude improvements in performance and energy consumption of many important workloads, such as graph analytics, databases, machine learning, video processing, climate modeling, genome analysis. We discuss adoption challenges for the memory-centric computing paradigm and conclude with some research & development opportunities.