Pipelined Serial Mode of RowClone

Source publication

RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization

Conference Paper

Full-text available

Dec 2013

Several system-level operations trigger bulk data copy or initialization. Even though these bulk data operations do not require any computation, current systems transfer a large quantity of data back and forth on the memory channel to perform such operations. As a result, bulk data operations consume high latency, bandwidth, and energy--degrading b...

Context 1

... READ/WRITE which interact with the memory channel connecting the processor and main memory, TRANSFER does not transfer data outside the chip. Figure 5 pictorially compares the operation of the TRANSFER command with that of READ and WRITE. The dashed lines indicate the data ow corresponding to the three commands. ...

View in full-text

PseudoNUMA for reducing memory interference in multi-core systems

Article

Full-text available

Jan 2014

The growing gap between microprocessor speed and DRAM speed is a major problem that computer designers are facing. In order to narrow the gap, it is necessary to improve DRAM's speed and throughput. Moreover, on multi-core platforms, DRAM memory shared by all cores usually suffers from the memory contention and interference problem, which can cause...

COREFAB: Concurrent Reconfigurable Fabric Utilization in Heterogeneous Multi-Core Systems

Conference Paper

Full-text available

Oct 2014

Application-specific accelerators may provide considerable speedup in single-core systems with a runtime-reconfigurable fabric (for simplicity called "fabric" in the following). A reconfigurable core, i.e. processor core pipeline coupled to a fabric, can be integrated along with regular general purpose processor cores (GPPs) into a reconfigurable m...

Memory Scraping Attack on Xilinx FPGAs: Private Data Extraction from Terminated Processes

Preprint

Full-text available

May 2024

FPGA-based hardware accelerators are becoming increasingly popular due to their versatility, customizability, energy efficiency, constant latency, and scalability. FPGAs can be tailored to specific algorithms, enabling efficient hardware implementations that effectively leverage algorithm parallelism. This can lead to significant performance improvements over CPUs and GPUs, particularly for highly parallel applications. For example, a recent study found that Stratix 10 FPGAs can achieve up to 90\% of the performance of a TitanX Pascal GPU while consuming less than 50\% of the power. This makes FPGAs an attractive choice for accelerating machine learning (ML) workloads. However, our research finds privacy and security vulnerabilities in existing Xilinx FPGA-based hardware acceleration solutions. These vulnerabilities arise from the lack of memory initialization and insufficient process isolation, which creates potential avenues for unauthorized access to private data used by processes. To illustrate this issue, we conducted experiments using a Xilinx ZCU104 board running the PetaLinux tool from Xilinx. We found that PetaLinux does not effectively clear memory locations associated with a terminated process, leaving them vulnerable to memory scraping attack (MSA). This paper makes two main contributions. The first contribution is an attack methodology of using the Xilinx debugger from a different user space. We find that we are able to access process IDs, virtual address spaces, and pagemaps of one user from a different user space because of lack of adequate process isolation. The second contribution is a methodology for characterizing terminated processes and accessing their private data. We illustrate this on Xilinx ML application library.

CMOS-based Single-Cycle In-Memory XOR/XNOR

Article

Full-text available

Jan 2024

Big data applications are on the rise, and so is the number of data centers. The ever-increasing massive data pool needs to be periodically backed up in a secure environment. Moreover, a massive amount of securely backed-up data is required for training binary convolutional neural networks for image classification. XOR and XNOR operations are essential for large-scale data copy verification, encryption, and classification algorithms. The disproportionate speed of existing compute and memory units makes the von Neumann architecture inefficient to perform these Boolean operations. Compute-in-memory (CiM) has proved to be an optimum approach for such bulk computations. The existing CiM-based XOR/XNOR techniques either require multiple cycles for computing or add to the complexity of the fabrication process. Here, we propose a CMOS-based hardware topology for single-cycle in-memory XOR/XNOR operations. Our design provides at least 2× improvement in the latency compared with other existing CMOS-compatible solutions. We verify the proposed system through circuit/system-level simulations and evaluate its robustness using a 5000-point Monte Carlo variation analysis. This all-CMOS design paves the way for practical implementation of CiM XOR/XNOR at scaled technology nodes.

A full spectrum of computing-in-memory technologies

Article

Nov 2023

PiMulator: A Processing-in-Memory Emulation Platform

Thesis

Full-text available

Aug 2023

Motivated by the memory wall problem, researchers propose numerous Processing-in-Memory (PiM) architectures to bring computation closer to data. Evaluating the performance of these emerging architectures is challenging due to the lack of tools that accurately mimic both software and hardware aspects. This thesis presents PiMulator, an open-source platform for system-level PiM emulation, suitable for rapid prototyping and evaluation of PiM architectures. At its core, PiMulator incorporates MEMulator, a main-memory emulation model implemented in SystemVerilog, enabling users to generate any desired memory configuration on the FPGA fabric with complete control over the PiM logic units. Furthermore, we develop and implement the FreezeTime mechanism, effectively extending the emulated memory capacity by synchronizing the limited FPGA chip memory resources with the board's DDR4 and HBM2 resources. This approach offers flexibility in modeling memory and logic behavior without compromising emulated time accuracy. The platform integrates the Memory+PiM model into the LiteX framework, ensuring compatibility with a robust FPGA and RISC-V ecosystem. We adopt a system emulation approach that combines CPUs, memory controllers, memories, interconnect, and peripherals using soft cores synthesizable on FPGA boards. This enables architects to prototype, emulate, and evaluate various PiM architectures and designs at the system level. PiMulator facilitates high-speed and high-fidelity modeling and evaluation of emerging memory and PiM architectures with workloads of interest, utilizing soft cores synthesizable on FPGA boards. This thesis demonstrates strategies to model several pioneering PiM architectures and provides detailed benchmark performance results, showcasing the platform's ability to facilitate design space exploration.

Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology

Preprint

Full-text available

Aug 2023

Processing-in-memory (PIM) has been explored for decades by computer architects, yet it has never seen the light of day in real-world products due to their high design overheads and lack of a killer application. With the advent of critical memory-intensive workloads, several commercial PIM technologies have been introduced to the market ranging from domain-specific PIM architectures to more general-purpose PIM architectures. In this work, we deepdive into UPMEM's commercial PIM technology, a general-purpose PIM-enabled parallel architecture that is highly programmable. Our first key contribution is the development of a flexible simulation framework for PIM. The simulator we developed (aka PIMulator) enables the compilation of UPMEM-PIM source codes into its compiled machine-level instructions, which are subsequently consumed by our cycle-level performance simulator. Using PIMulator, we demystify UPMEM's PIM design through a detailed characterization study. Building on top of our characterization, we conduct a series of case studies to pathfind important architectural features that we deem will be critical for future PIM architectures to support

Digital in-memory stochastic computing architecture for vector-matrix multiplication

Article

Full-text available

Jul 2023

The applications of the Artificial Intelligence are currently dominating the technology landscape. Meanwhile, the conventional Von Neumann architectures are struggling with the data-movement bottleneck to meet the ever-increasing performance demands of these data-centric applications. Moreover, The vector-matrix multiplication cost, in the binary domain, is a major computational bottleneck for these applications. This paper introduces a novel digital in-memory stochastic computing architecture that leverages the simplicity of the stochastic computing for in-memory vector-matrix multiplication. The proposed architecture incorporates several new approaches including a new stochastic number generator with ideal binary-to-stochastic mapping, a best seeding approach for accurate-enough low stochastic bit-precisions, a hybrid stochastic-binary accumulation approach for vector-matrix multiplication, and the conversion of conventional memory read operations into on-the-fly stochastic multiplication operations with negligible overhead. Thanks to the combination of these approaches, the accuracy analysis of the vector-matrix multiplication benchmark shows that scaling down the stochastic bit-precision from 16-bit to 4-bit achieves nearly the same average error (less than 3%). The derived analytical model of the proposed in-memory stochastic computing architecture demonstrates that the 4-bit stochastic architecture achieves the highest throughput per sub-array (122 Ops/Cycle), which is better than the 16-bit stochastic precision by 4.36x, while still maintaining a small average error of 2.25%.

Accelerating Relational Database Analytical Processing with Bulk-Bitwise Processing-in-Memory

Preprint

Full-text available

Jul 2023

Online Analytical Processing (OLAP) for relational databases is a business decision support application. The application receives queries about the business database, usually requesting to summarize many database records, and produces few results. Existing OLAP requires transferring a large amount of data between the memory and the CPU, having a few operations per datum, and producing a small output. Hence, OLAP is a good candidate for processing-in-memory (PIM), where computation is performed where the data is stored, thus accelerating applications by reducing data movement between the memory and CPU. In particular, bulk-bitwise PIM, where the memory array is a bit-vector processing unit, seems a good match for OLAP. With the extensive inherent parallelism and minimal data movement of bulk-bitwise PIM, OLAP applications can process the entire database in parallel in memory, transferring only the results to the CPU. This paper shows a full stack adaptation of a bulk-bitwise PIM, from compiling SQL to hardware implementation, for supporting OLAP applications. Evaluating the Star Schema Benchmark (SSB), bulk-bitwise PIM achieves a 4.65X speedup over Monet-DB, a standard database system.

X-ray: Discovering DRAM Internal Structure and Error Characteristics by Issuing Memory Commands

Preprint

Jun 2023

The demand for accurate information about the internal structure and characteristics of dynamic random-access memory (DRAM) has been on the rise. Recent studies have explored the structure and characteristics of DRAM to improve processing in memory, enhance reliability, and mitigate a vulnerability known as rowhammer. However, DRAM manufacturers only disclose limited information through official documents, making it difficult to find specific information about actual DRAM devices. This paper presents reliable findings on the internal structure and characteristics of DRAM using activate-induced bitflips (AIBs), retention time test, and row-copy operation. While previous studies have attempted to understand the internal behaviors of DRAM devices, they have only shown results without identifying the causes or have analyzed DRAM modules rather than individual chips. We first uncover the size, structure, and operation of DRAM subarrays and verify our findings on the characteristics of DRAM. Then, we correct misunderstood information related to AIBs and demonstrate experimental results supporting the cause of rowhammer. We expect that the information we uncover about the structure, behavior, and characteristics of DRAM will help future DRAM research.

An Experimental Analysis of RowHammer in HBM2 DRAM Chips

Conference Paper

Full-text available

Jun 2023

Memory-Centric Computing

Preprint

Full-text available

May 2023

Onur Mutlu

Memory-centric computing aims to enable computation capability in and near all places where data is generated and stored. As such, it can greatly reduce the large negative performance and energy impact of data access and data movement, by fundamentally avoiding data movement and reducing data access latency & energy. Many recent studies show that memory-centric computing can greatly improve system performance and energy efficiency. Major industrial vendors and startup companies have also recently introduced memory chips that have sophisticated computation capabilities. This talk describes promising ongoing research and development efforts in memory-centric computing. We classify such efforts into two major fundamental categories: 1) processing using memory, which exploits analog operational properties of memory structures to perform massively-parallel operations in memory, and 2) processing near memory, which integrates processing capability in memory controllers, the logic layer of 3D-stacked memory technologies, or memory chips to enable high-bandwidth and low-latency memory access to near-memory logic. We show both types of architectures (and their combination) can enable orders of magnitude improvements in performance and energy consumption of many important workloads, such as graph analytics, databases, machine learning, video processing, climate modeling, genome analysis. We discuss adoption challenges for the memory-centric computing paradigm and conclude with some research & development opportunities.

Pipelined Serial Mode of RowClone

Context in source publication

Similar publications

Citations