Physical memory layout with the LCP framework. 64B); 3 n = V C is the number of cache lines per virtual page (e.g.,

Source publication

Linearly compressed pages: A low-complexity, low-latency main memory compression framework

Conference Paper

Full-text available

Dec 2013

Data compression is a promising approach for meeting the increasing memory capacity demands expected in future systems. Unfortunately, existing compression algorithms do not translate well when directly applied to main memory because they require the memory controller to perform non-trivial computation to locate a cache line within a compressed mem...

Context 1

... organization simplifies the computation re- quired to determine the main memory address for the compressed slot corresponding to a cache line. More specifically, the address of the compressed slot for the i th cache line can be computed as p-base + m-size * c-base + (i − 1)C * , where the first two terms cor- respond to the start of the LCP (m-size equals to the minimum page size, 512B in our implementation) and the third indicates the off- set within the LCP of the i th compressed slot (see Figure 6). Thus, computing the main memory address of a compressed cache line re- quires one multiplication (can be implemented as a shift) and two additions independent of i (fixed latency). ...

View in full-text

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency

Article

Full-text available

Sep 2012

Data compression is a promising technique to address the increasing main memory capacity demand in future systems. Unfortunately, directly applying previously proposed compression algorithms to main memory requires the memory controller to perform non-trivial computations to locate a cache line within the compressed main memory. These additional co...

Web System to Simulate the Multilevel Cache Memory Occupation for Teaching Purposes

Article

Full-text available

This paper presents a Web System for the simulation of a Multilevel Cache Memory System with teach-ing purposes. There are three elements in the whole system: the reference simulator; the web system por-tal and the control database. The simulator shows the numbers of hits and misses in the accesses to the cache memory and the cache contents when th...

SecDir: a secure directory to defeat directory side-channel attacks

Conference Paper

Full-text available

Jun 2019

Directories for cache coherence have been recently shown to be vulnerable to conflict-based side-channel attacks. By forcing directory conflicts, an attacker can evict victim directory entries, which in turn trigger the eviction of victim cache lines from private caches. This evidence strongly suggests that directories need to be redesigned for sec...

Subexponential and Linear Subpacketization Coded Caching via Line Graphs and Projective Geometry

Preprint

Full-text available

Jan 2020

Large gains in the rate of cache-aided broadcast communication are obtained using coded caching, but to obtain this most existing centralized coded caching schemes require that the files at the server be divisible into a large number of parts (this number is called subpacketization). In fact, most schemes require the subpacketization to be growing...

The reuse cache: Downsizing the shared last-level cache

Conference Paper

Full-text available

Dec 2013

Over recent years, a growing body of research has shown that a considerable portion of the shared last-level cache (SLLC) is dead, meaning that the corresponding cache lines are stored but they will not receive any further hits before being replaced. Conversely, most hits observed by the SLLC come from a small subset of already reused lines. In thi...

ReFloat: Low-Cost Floating-Point Processing in ReRAM for Accelerating Iterative Linear Solvers

Conference Paper

Nov 2023

Resistive random access memory (ReRAM) is a promising technology that can perform low-cost and in-situ matrix-vector multiplication (MVM) in analog domain. Scientific computing requires high-precision floating-point (FP) processing. However, performing floating-point computation in ReRAM is challenging because of high hardware cost and execution time due to the large FP value range. In this work we present ReFloat, a data format and an accelerator architecture, for low-cost and high-performance floating-point processing in ReRAM for iterative linear solvers. ReFloat matches the ReRAM crossbar hardware and represents a block of FP values with reduced bits and an optimized exponent base for a high range of dynamic representation. Thus, ReFloat achieves less ReRAM crossbar consumption and fewer processing cycles and overcomes the noncovergence issue in a prior work. The evaluation on the SuiteSparse matrices shows ReFloat achieves 5.02× to 84.28× improvement in terms of solver time compared to a state-of-the-art ReRAM based accelerator.

QZRAM: A Transparent Kernel Memory Compression System Design for Memory-Intensive Applications with QAT Accelerator Integration

Article

Full-text available

Sep 2023

In recent decades, memory-intensive applications have experienced a boom, e.g., machine learning, natural language processing (NLP), and big data analytics. Such applications often experience out-of-memory (OOM) errors, which cause unexpected processes to exit without warning, resulting in negative effects on a system’s performance and stability. To mitigate OOM errors, many operating systems implement memory compression (e.g., Linux’s ZRAM) to provide flexible and larger memory space. However, these schemes incur two problems: (1) high-compression algorithms consume significant CPU resources, which inevitably degrades application performance; and (2) compromised compression algorithms with low latency and low compression ratios result in insignificant increases in memory space. In this paper, we propose QZRAM, which achieves a high-compression-ratio algorithm without high computing consumption through the integration of QAT (an ASIC accelerator) into ZRAM. To enhance hardware and software collaboration, a page-based parallel write module is introduced to serve as a more efficient request processing flow. More importantly, a QAT offloading module is introduced to asynchronously offload compression to the QAT accelerator, reducing CPU computing resource consumption and addressing two challenges: long CPU idle time and low usage of the QAT unit. The comprehensive evaluation validates that QZRAM can reduce CPU resources by up to 49.2% for the FIO micro-benchmark, increase memory space (1.66×) compared to ZRAM, and alleviate the memory overflow phenomenon of the Redis benchmark.

A Modern Primer on Processing in Memory

Chapter

Full-text available

Jan 2023

Modern computing systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in computing that cause performance, scalability and energy bottlenecks: (1) data access is a key bottleneck as many important applications are increasingly data-intensive, and memory bandwidth and energy do not scale well, (2) energy consumption is a key limiter in almost all computing platforms, especially server and mobile systems, (3) data movement, especially off-chip to on-chip, is very expensive in terms of bandwidth, energy and latency, much more so than computation. These trends are especially severely-felt in the data-intensive server and energy-constrained mobile systems of today. At the same time, conventional memory technology is facing many technology scaling challenges in terms of reliability, energy, and performance. As a result, memory system architects are open to organizing memory in different ways and making it more intelligent, at the expense of higher cost. The emergence of 3D-stacked memory plus logic, the adoption of error correcting codes inside the latest DRAM chips, proliferation of different main memory standards and chips, specialized for different purposes (e.g., graphics, low-power, high bandwidth, low latency), and the necessity of designing new solutions to serious reliability and security issues, such as the RowHammer phenomenon, are an evidence of this trend. This chapter discusses recent research that aims to practically enable computation close to data, an approach we call processing-in-memory (PIM). PIM places computation mechanisms in or near where the data is stored (i.e., inside the memory chips, in the logic layer of 3D-stacked memory, or in the memory controllers), so that data movement between the computation units and memory is reduced or eliminated. While the general idea of PIM is not new, we discuss motivating trends in applications as well as memory circuits/technology that greatly exacerbate the need for enabling it in modern computing systems. We examine at least two promising new approaches to designing PIM systems to accelerate important data-intensive applications: (1) processing using memory by exploiting analog operational properties of DRAM chips to perform massively-parallel operations in memory, with low-cost changes, (2) processing near memory by exploiting 3D-stacked memory technology design to provide high memory bandwidth and low memory latency to in-memory logic. In both approaches, we describe and tackle relevant cross-layer research, design, and adoption challenges in devices, architecture, systems, and programming models. Our focus is on the development of in-memory processing designs that can be adopted in real computing platforms at low cost. We conclude by discussing work on solving key challenges to the practical adoption of PIM.KeywordsMemory systemsData movementMain memoryProcessing-in-memoryNear-data processingComputation-in-memoryProcessing using memoryProcessing near memory3D-stacked memoryNon-volatile memoryEnergy efficiencyHigh-performance computingComputer architectureComputing paradigmEmerging technologiesMemory scalingTechnology scalingDependable systemsRobust systemsHardware securitySystem securityLatencyLow-latency computing

Translation-optimized Memory Compression for Capacity

Conference Paper

Full-text available

Oct 2022

Exploiting Data Compression for Adaptive Block Placement in Hybrid Caches

Article

Full-text available

Jan 2022

STT-RAM (Spin-Transfer Torque Random Access Memory) appears to be a viable alternative to SRAM-based on-chip caches. Due to its high density and low leakage power, STT-RAM can be used to build massive capacity last-level caches (LLC). Unfortunately, STT-RAM has a much longer write latency and a much greater write energy than SRAM. Researchers developed hybrid caches made up of SRAM and STT-RAM regions to cope with these challenges. In order to store as many write-intensive blocks in the SRAM region as possible in hybrid caches, an intelligent block placement policy is essential. This paper proposes an adaptive block placement framework for hybrid caches that incorporates metadata embedding (ADAM). When a cache block is evicted from the LLC, ADAM embeds metadata (i.e., write intensity) into the block. Metadata embedded in the cache block are then extracted and used to determine the block’s write intensity when it is fetched from main memory. Our research demonstrates that ADAM can enhance performance by 26% (on average) when compared to a baseline block placement scheme.

PF-DRAM: A Precharge-Free DRAM Structure

Conference Paper

Full-text available

Dec 2021

EnTiered-ReRAM: An Enhanced Low Latency and Energy Efficient TLC Crossbar ReRAM Architecture

Article

Full-text available

Nov 2021

Resistive Random Access Memory (ReRAM) is promising to be employed as high density storage-class memory due to its crossbar array and Triple-Level Cell (TLC) structures. However, TLC crossbar ReRAM suffers from high write latency and energy due to three unique challenges: (1) The crossbar array structure incurs IR drop issues. (2) The TLC structure requires iterative program-and-verify (P&V) procedure. (3) The resistance drift problem needs short interval scrub to avoid uncorrectable soft errors. In this article, to overcome the challenges of TLC crossbar ReRAM, we propose an enhanced low latency and energy efficient TLC crossbar ReRAM architecture, called EnTiered-ReRAM. The proposed EnTiered-ReRAM is composed of four components, including EnTiered-crossbar design, Compression-based Incomplete Data Mapping (CIDM), Compression-based Flip Scheme (CFS), and Compression-based Error Correction Code (CECC). Specifically, based on the observation that our previously proposed Tiered-crossbar design still suffers from large IR drops along bitlines in the far segments due to the long length of bitlines, EnTiered-crossbar partitions each crossbar array into two halves along bitlines, and then splits each bitline of the half crossbar array into the near and far segments by an isolation transistor, which thoroughly mitigates the IR drop issues. Then we use our previously proposed CIDM and CFS in the near and far segments of EnTiered-crossbar arrays to further decreases the write latency and energy. In addition, CECC is deeply coupled with CIDM and CFS. CECC dynamically employs the most appropriate ECC capability according to the remaining space of each cache line after CIDM or CFS encoding, which effectively improves the scrub interval and performance/energy with insignificant space overhead. The evaluation results show that, compared with an aggressive baseline, EnTiered-ReRAM can improve the system performance by 56.3% and reduce the energy consumption by 60.6% on average.

Extending Memory Capacity in Consumer Devices with Emerging Non-Volatile Memory: An Experimental Study

Preprint

Full-text available

Nov 2021

The number and diversity of consumer devices are growing rapidly, alongside their target applications' memory consumption. Unfortunately, DRAM scalability is becoming a limiting factor to the available memory capacity in consumer devices. As a potential solution, manufacturers have introduced emerging non-volatile memories (NVMs) into the market, which can be used to increase the memory capacity of consumer devices by augmenting or replacing DRAM. Since entirely replacing DRAM with NVM in consumer devices imposes large system integration and design challenges, recent works propose extending the total main memory space available to applications by using NVM as swap space for DRAM. However, no prior work analyzes the implications of enabling a real NVM-based swap space in real consumer devices. In this work, we provide the first analysis of the impact of extending the main memory space of consumer devices using off-the-shelf NVMs. We extensively examine system performance and energy consumption when the NVM device is used as swap space for DRAM main memory to effectively extend the main memory capacity. For our analyses, we equip real web-based Chromebook computers with the Intel Optane SSD, which is a state-of-the-art low-latency NVM-based SSD device. We compare the performance and energy consumption of interactive workloads running on our Chromebook with NVM-based swap space, where the Intel Optane SSD capacity is used as swap space to extend main memory capacity, against two state-of-the-art systems: (i) a baseline system with double the amount of DRAM than the system with the NVM-based swap space; and (ii) a system where the Intel Optane SSD is naively replaced with a state-of-the-art (yet slower) off-the-shelf NAND-flash-based SSD, which we use as a swap space of equivalent size as the NVM-based swap space.

GraphMineSuite: Enabling High-Performance and Programmable Graph Mining Algorithms with Set Algebra

Preprint

Mar 2021

We propose GraphMineSuite (GMS): the first benchmarking suite for graph mining that facilitates evaluating and constructing high-performance graph mining algorithms. First, GMS comes with a benchmark specification based on extensive literature review, prescribing representative problems, algorithms, and datasets. Second, GMS offers a carefully designed software platform for seamless testing of different fine-grained elements of graph mining algorithms, such as graph representations or algorithm subroutines. The platform includes parallel implementations of more than 40 considered baselines, and it facilitates developing complex and fast mining algorithms. High modularity is possible by harnessing set algebra operations such as set intersection and difference, which enables breaking complex graph mining algorithms into simple building blocks that can be separately experimented with. GMS is supported with a broad concurrency analysis for portability in performance insights, and a novel performance metric to assess the throughput of graph mining algorithms, enabling more insightful evaluation. As use cases, we harness GMS to rapidly redesign and accelerate state-of-the-art baselines of core graph mining problems: degeneracy reordering (by up to >2x), maximal clique listing (by up to >9x), k-clique listing (by 1.1x), and subgraph isomorphism (by up to 2.5x), also obtaining better theoretical performance bounds.

TADOC: Text analytics directly on compression

Article

Full-text available

Mar 2021
VLDB J

This article provides a comprehensive description of text analytics directly on compression (TADOC), which enables direct document analytics on compressed textual data. The article explains the concept of TADOC and the challenges to its effective realizations. Additionally, a series of guidelines and technical solutions that effectively address those challenges, including the adoption of a hierarchical compression method and a set of novel algorithms and data structure designs, are presented. Experiments on six data analytics tasks of various complexities show that TADOC can save 90.8% storage space and 87.9% memory usage, while halving data processing times.

Physical memory layout with the LCP framework. 64B); 3 n = V C is the number of cache lines per virtual page (e.g.,

Context in source publication

Similar publications

Citations