Figure 6 - uploaded by Gennady Pekhimenko
Content may be subject to copyright.
Physical memory layout with the LCP framework. 64B); 3 n = V C is the number of cache lines per virtual page (e.g.,  

Physical memory layout with the LCP framework. 64B); 3 n = V C is the number of cache lines per virtual page (e.g.,  

Source publication
Conference Paper
Full-text available
Data compression is a promising approach for meeting the increasing memory capacity demands expected in future systems. Unfortunately, existing compression algorithms do not translate well when directly applied to main memory because they require the memory controller to perform non-trivial computation to locate a cache line within a compressed mem...

Context in source publication

Context 1
... organization simplifies the computation re- quired to determine the main memory address for the compressed slot corresponding to a cache line. More specifically, the address of the compressed slot for the i th cache line can be computed as p-base + m-size * c-base + (i − 1)C * , where the first two terms cor- respond to the start of the LCP (m-size equals to the minimum page size, 512B in our implementation) and the third indicates the off- set within the LCP of the i th compressed slot (see Figure 6). Thus, computing the main memory address of a compressed cache line re- quires one multiplication (can be implemented as a shift) and two additions independent of i (fixed latency). ...

Similar publications

Article
Full-text available
Data compression is a promising technique to address the increasing main memory capacity demand in future systems. Unfortunately, directly applying previously proposed compression algorithms to main memory requires the memory controller to perform non-trivial computations to locate a cache line within the compressed main memory. These additional co...
Article
Full-text available
This paper presents a Web System for the simulation of a Multilevel Cache Memory System with teach-ing purposes. There are three elements in the whole system: the reference simulator; the web system por-tal and the control database. The simulator shows the numbers of hits and misses in the accesses to the cache memory and the cache contents when th...
Conference Paper
Full-text available
Directories for cache coherence have been recently shown to be vulnerable to conflict-based side-channel attacks. By forcing directory conflicts, an attacker can evict victim directory entries, which in turn trigger the eviction of victim cache lines from private caches. This evidence strongly suggests that directories need to be redesigned for sec...
Preprint
Full-text available
Large gains in the rate of cache-aided broadcast communication are obtained using coded caching, but to obtain this most existing centralized coded caching schemes require that the files at the server be divisible into a large number of parts (this number is called subpacketization). In fact, most schemes require the subpacketization to be growing...
Conference Paper
Full-text available
Over recent years, a growing body of research has shown that a considerable portion of the shared last-level cache (SLLC) is dead, meaning that the corresponding cache lines are stored but they will not receive any further hits before being replaced. Conversely, most hits observed by the SLLC come from a small subset of already reused lines. In thi...

Citations

... Format and architecture cooptimization includes [37,43,90] on CMOS platforms but they are not for emerging PIM architectures and not for scientific computing. Data compression are explored on DRAM systems [62,77,78]. ...
Conference Paper
Resistive random access memory (ReRAM) is a promising technology that can perform low-cost and in-situ matrix-vector multiplication (MVM) in analog domain. Scientific computing requires high-precision floating-point (FP) processing. However, performing floating-point computation in ReRAM is challenging because of high hardware cost and execution time due to the large FP value range. In this work we present ReFloat, a data format and an accelerator architecture, for low-cost and high-performance floating-point processing in ReRAM for iterative linear solvers. ReFloat matches the ReRAM crossbar hardware and represents a block of FP values with reduced bits and an optimized exponent base for a high range of dynamic representation. Thus, ReFloat achieves less ReRAM crossbar consumption and fewer processing cycles and overcomes the noncovergence issue in a prior work. The evaluation on the SuiteSparse matrices shows ReFloat achieves 5.02× to 84.28× improvement in terms of solver time compared to a state-of-the-art ReRAM based accelerator.
... Software memory compression schemes [7,10,11] are often bottlenecked by the computation overhead. Other memory compression schemes [9,[12][13][14][15][16][17][18] are performed by a dedicated processor between the CPU cache and physical memory, which is called hardware memory compression. ...
Article
Full-text available
In recent decades, memory-intensive applications have experienced a boom, e.g., machine learning, natural language processing (NLP), and big data analytics. Such applications often experience out-of-memory (OOM) errors, which cause unexpected processes to exit without warning, resulting in negative effects on a system’s performance and stability. To mitigate OOM errors, many operating systems implement memory compression (e.g., Linux’s ZRAM) to provide flexible and larger memory space. However, these schemes incur two problems: (1) high-compression algorithms consume significant CPU resources, which inevitably degrades application performance; and (2) compromised compression algorithms with low latency and low compression ratios result in insignificant increases in memory space. In this paper, we propose QZRAM, which achieves a high-compression-ratio algorithm without high computing consumption through the integration of QAT (an ASIC accelerator) into ZRAM. To enhance hardware and software collaboration, a page-based parallel write module is introduced to serve as a more efficient request processing flow. More importantly, a QAT offloading module is introduced to asynchronously offload compression to the QAT accelerator, reducing CPU computing resource consumption and addressing two challenges: long CPU idle time and low usage of the QAT unit. The comprehensive evaluation validates that QZRAM can reduce CPU resources by up to 49.2% for the FIO micro-benchmark, increase memory space (1.66×) compared to ZRAM, and alleviate the memory overflow phenomenon of the Redis benchmark.
... Due to its relatively low cost and low latency, Dynamic Random Access Memory (DRAM) [188] is the predominant data storage technology that is used to build main memory. The growing data working set sizes of modern applications [1][2][3][4][5][6][7][8][9][189][190][191][192][193] impose an everincreasing demand for higher DRAM capacity and performance. Unfortunately, DRAM technology scaling is becoming increasingly challenging: it is increasingly difficult to enlarge DRAM chip capacity at low cost while also maintaining maintain DRAM performance, energy efficiency, and reliability [1,2,30,34,53,55,56,[194][195][196][197][198] Thus, fulfilling the increasing memory needs of modern workloads is becoming increasingly costly and difficult [2-4, 16, 27-29, 31-33, 35-50, 52, 54-57, 59, 62, 63, 100, 136, 199-204]. ...
Chapter
Full-text available
Modern computing systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in computing that cause performance, scalability and energy bottlenecks: (1) data access is a key bottleneck as many important applications are increasingly data-intensive, and memory bandwidth and energy do not scale well, (2) energy consumption is a key limiter in almost all computing platforms, especially server and mobile systems, (3) data movement, especially off-chip to on-chip, is very expensive in terms of bandwidth, energy and latency, much more so than computation. These trends are especially severely-felt in the data-intensive server and energy-constrained mobile systems of today. At the same time, conventional memory technology is facing many technology scaling challenges in terms of reliability, energy, and performance. As a result, memory system architects are open to organizing memory in different ways and making it more intelligent, at the expense of higher cost. The emergence of 3D-stacked memory plus logic, the adoption of error correcting codes inside the latest DRAM chips, proliferation of different main memory standards and chips, specialized for different purposes (e.g., graphics, low-power, high bandwidth, low latency), and the necessity of designing new solutions to serious reliability and security issues, such as the RowHammer phenomenon, are an evidence of this trend. This chapter discusses recent research that aims to practically enable computation close to data, an approach we call processing-in-memory (PIM). PIM places computation mechanisms in or near where the data is stored (i.e., inside the memory chips, in the logic layer of 3D-stacked memory, or in the memory controllers), so that data movement between the computation units and memory is reduced or eliminated. While the general idea of PIM is not new, we discuss motivating trends in applications as well as memory circuits/technology that greatly exacerbate the need for enabling it in modern computing systems. We examine at least two promising new approaches to designing PIM systems to accelerate important data-intensive applications: (1) processing using memory by exploiting analog operational properties of DRAM chips to perform massively-parallel operations in memory, with low-cost changes, (2) processing near memory by exploiting 3D-stacked memory technology design to provide high memory bandwidth and low memory latency to in-memory logic. In both approaches, we describe and tackle relevant cross-layer research, design, and adoption challenges in devices, architecture, systems, and programming models. Our focus is on the development of in-memory processing designs that can be adopted in real computing platforms at low cost. We conclude by discussing work on solving key challenges to the practical adoption of PIM.KeywordsMemory systemsData movementMain memoryProcessing-in-memoryNear-data processingComputation-in-memoryProcessing using memoryProcessing near memory3D-stacked memoryNon-volatile memoryEnergy efficiencyHigh-performance computingComputer architectureComputing paradigmEmerging technologiesMemory scalingTechnology scalingDependable systemsRobust systemsHardware securitySystem securityLatencyLow-latency computing
... To increase effective memory capacity without increasing actual DRAM cost, many prior works [4], [5], [6], [7], [8], [9], [10] have explored hardware memory compression. Hardware transparently compresses DRAM content on-thefly with the memory controller evicting/writing back memory blocks to DRAM. ...
... Hardware memory compression: Many prior works [4], [5], [6], [7], [8], [9], [10], [12], [13] have explored hardware memory compression; memory controller (MC) transparently compresses content on-the-fly with evicting/writing back memory blocks to DRAM and transparently decompresses DRAM content on-the-fly for every LLC miss. ...
... A prior work -LCP [5] -relies on OS support to embed some CTE information into PTEs; OS manages new compressed pages of different sizes (e.g., 2KB, 1KB) and records the compressed page size in PTEs. LCP uses the embedded compressed size of a page to predict the page's data blocks' DRAM locations to speculatively access data in DRAM in parallel with accessing the CTE in DRAM. ...
... The efficiency of metadata embedding is highly related to the ratio of blocks that can be compressed to a specific size. Fortunately, most blocks can be compressed to the target size because on-chip cache data has a high degree of redundancy as also demonstrated in numerous previous works [9][10][11][12][13][14][15][16][17], and the compression ratio required for metadata embedding is low. Figure 7 shows the percentage of the blocks (64 bytes) that can be compressed to less than 61 bytes for memory-intensive SPEC CPU2006 benchmarks. ...
... STT-RAM and SRAM parameters are obtained using NVSim [1]. Since ADAM employs low-latency compression techniques (i.e., BDI and FPC) specifically designed for the on-chip caches, we assume that decompression takes a single clock cycle, as performed in many prior studies [9][10][11][15][16][17]. For evaluations, we chose memory-intensive benchmarks, which have greater than 1 MPKI (LLC Misses Per Kilo instructions), from SPEC CPU2006. ...
... Several prior works have proposed low-cost compression techniques [10,11,38]. As these compression techniques have low decompression latency and low implementation cost, they have been used to improve the effective capacity, energy efficiency, and bandwidth of the memory systems [9,[12][13][14][15][16][17]. ADAM employs BDI [10] and FPC [11] as compression techniques and to obtain the idea of metadata embedding from [9]. ...
Article
Full-text available
STT-RAM (Spin-Transfer Torque Random Access Memory) appears to be a viable alternative to SRAM-based on-chip caches. Due to its high density and low leakage power, STT-RAM can be used to build massive capacity last-level caches (LLC). Unfortunately, STT-RAM has a much longer write latency and a much greater write energy than SRAM. Researchers developed hybrid caches made up of SRAM and STT-RAM regions to cope with these challenges. In order to store as many write-intensive blocks in the SRAM region as possible in hybrid caches, an intelligent block placement policy is essential. This paper proposes an adaptive block placement framework for hybrid caches that incorporates metadata embedding (ADAM). When a cache block is evicted from the LLC, ADAM embeds metadata (i.e., write intensity) into the block. Metadata embedded in the cache block are then extracted and used to determine the block’s write intensity when it is fetched from main memory. Our research demonstrates that ADAM can enhance performance by 26% (on average) when compared to a baseline block placement scheme.
... The main drawback of these techniques is that the decompression latency is added to DRAM latency. Furthermore, caching and virtual to physical page mapping need an extra layer of indirection, with further added complexity and memory latency [39], [55]. ...
... Compression is an effective technique to save the storage space [25]- [28]. Two pattern-based compression techniques, e.g., Frequent pattern compression (FPC) [29] and basedelta-immediate compression (BDI) [30], are commonly used due to their high compressibility, low overhead, low latency, and low complexity. ...
Article
Full-text available
Resistive Random Access Memory (ReRAM) is promising to be employed as high density storage-class memory due to its crossbar array and Triple-Level Cell (TLC) structures. However, TLC crossbar ReRAM suffers from high write latency and energy due to three unique challenges: (1) The crossbar array structure incurs IR drop issues. (2) The TLC structure requires iterative program-and-verify (P&V) procedure. (3) The resistance drift problem needs short interval scrub to avoid uncorrectable soft errors. In this article, to overcome the challenges of TLC crossbar ReRAM, we propose an enhanced low latency and energy efficient TLC crossbar ReRAM architecture, called EnTiered-ReRAM. The proposed EnTiered-ReRAM is composed of four components, including EnTiered-crossbar design, Compression-based Incomplete Data Mapping (CIDM), Compression-based Flip Scheme (CFS), and Compression-based Error Correction Code (CECC). Specifically, based on the observation that our previously proposed Tiered-crossbar design still suffers from large IR drops along bitlines in the far segments due to the long length of bitlines, EnTiered-crossbar partitions each crossbar array into two halves along bitlines, and then splits each bitline of the half crossbar array into the near and far segments by an isolation transistor, which thoroughly mitigates the IR drop issues. Then we use our previously proposed CIDM and CFS in the near and far segments of EnTiered-crossbar arrays to further decreases the write latency and energy. In addition, CECC is deeply coupled with CIDM and CFS. CECC dynamically employs the most appropriate ECC capability according to the remaining space of each cache line after CIDM or CFS encoding, which effectively improves the scrub interval and performance/energy with insignificant space overhead. The evaluation results show that, compared with an aggressive baseline, EnTiered-ReRAM can improve the system performance by 56.3% and reduce the energy consumption by 60.6% on average.
... Pages are compressed before being moved from the working region in main memory to ZRAM (i.e., swapped in), and decompressed before being moved from ZRAM to the working region in main memory (i.e., swapped out). By using compression in ZRAM, the system can increase the capacity of the swap space by the compression ratio (usually 3:1 [158][159][160]). ...
... The device is equipped with a 7th-generation Intel Core i3-7100U processor [189], 8 GB DDR4 memory [190], and a 32 GB NAND-flash-based SSD [191]. ChromeOS uses up to 50% of the DRAM capacity (i.e., 4 GB) to enable an in-DRAM compressed swap space called ZRAM, capable of holding up to 12 GB of compressed data (assuming a 3:1 compression ratio [158][159][160]). We modify the system by (1) removing the in-DRAM compressed swap space and including an Intel Optane SSD module, which the system uses as the swap device for DRAM; and (2) reducing the DRAM size to 4 GB, to hold the non-swap-space DRAM capacity constant. ...
... We compare the performance and energy consumption of interactive workloads running on our Chromebook using three system configurations, as Table 1 describes: • Baseline: a baseline system with 8 GB of DRAM. 4 GB are used as main memory, which is uncompressed, and the other 4 GB are used as an in-DRAM compressed swap space (ZRAM), which can house up to 12 GB of actual data, assuming a 3:1 compression ratio [158][159][160]; • Optane: a system with 4 GB of main memory, and 16 GB of Intel Optane SSD swap space; • NANDFlash: a system with 4 GB of main memory, and 16 GB of NAND-flash-based SSD swap space. ...
Preprint
Full-text available
The number and diversity of consumer devices are growing rapidly, alongside their target applications' memory consumption. Unfortunately, DRAM scalability is becoming a limiting factor to the available memory capacity in consumer devices. As a potential solution, manufacturers have introduced emerging non-volatile memories (NVMs) into the market, which can be used to increase the memory capacity of consumer devices by augmenting or replacing DRAM. Since entirely replacing DRAM with NVM in consumer devices imposes large system integration and design challenges, recent works propose extending the total main memory space available to applications by using NVM as swap space for DRAM. However, no prior work analyzes the implications of enabling a real NVM-based swap space in real consumer devices. In this work, we provide the first analysis of the impact of extending the main memory space of consumer devices using off-the-shelf NVMs. We extensively examine system performance and energy consumption when the NVM device is used as swap space for DRAM main memory to effectively extend the main memory capacity. For our analyses, we equip real web-based Chromebook computers with the Intel Optane SSD, which is a state-of-the-art low-latency NVM-based SSD device. We compare the performance and energy consumption of interactive workloads running on our Chromebook with NVM-based swap space, where the Intel Optane SSD capacity is used as swap space to extend main memory capacity, against two state-of-the-art systems: (i) a baseline system with double the amount of DRAM than the system with the NVM-based swap space; and (ii) a system where the Intel Optane SSD is naively replaced with a state-of-the-art (yet slower) off-the-shelf NAND-flash-based SSD, which we use as a swap space of equivalent size as the NVM-based swap space.
... This includes moving into distributed processing [102,103] and incorporating high-performance techniques [95,104,205] such as Remote Direct Memory Access [30,33,98,99,185,187] combined with using general high-performance networks that work well with communication-intensive workloads [25,31,38,85]. We are also working on variants of graph mining algorithms in GMS that harness the capabilities of the underlying hardware, such as low-diameter on-chip networks [29,105,160], NUMA and focus on data locality [187,203], near-and in-memory processing [7,8,52,112,114,134,135,163,164,170,[188][189][190], various architecturerelated compression techniques [171,172], and others [81,128]. One could incorporate various forms of recently proposed lossy graph compression and summarization [34,41,144], and graph neural networks [22,215,216]. ...
Preprint
We propose GraphMineSuite (GMS): the first benchmarking suite for graph mining that facilitates evaluating and constructing high-performance graph mining algorithms. First, GMS comes with a benchmark specification based on extensive literature review, prescribing representative problems, algorithms, and datasets. Second, GMS offers a carefully designed software platform for seamless testing of different fine-grained elements of graph mining algorithms, such as graph representations or algorithm subroutines. The platform includes parallel implementations of more than 40 considered baselines, and it facilitates developing complex and fast mining algorithms. High modularity is possible by harnessing set algebra operations such as set intersection and difference, which enables breaking complex graph mining algorithms into simple building blocks that can be separately experimented with. GMS is supported with a broad concurrency analysis for portability in performance insights, and a novel performance metric to assess the throughput of graph mining algorithms, enabling more insightful evaluation. As use cases, we harness GMS to rapidly redesign and accelerate state-of-the-art baselines of core graph mining problems: degeneracy reordering (by up to >2x), maximal clique listing (by up to >9x), k-clique listing (by 1.1x), and subgraph isomorphism (by up to 2.5x), also obtaining better theoretical performance bounds.
... The time taken to compress the datasets using sequential Sequitur ranges from 10 min to over 20 h. Using a parallel or distributed Sequitur with accelerators (e.g., as in [14,70]) can potentially shorten the compression time substantially. Note that this article focuses on how to use the compressed data to support various analytics tasks such as word count; we do not further discuss the compression process. ...
Article
Full-text available
This article provides a comprehensive description of text analytics directly on compression (TADOC), which enables direct document analytics on compressed textual data. The article explains the concept of TADOC and the challenges to its effective realizations. Additionally, a series of guidelines and technical solutions that effectively address those challenges, including the adoption of a hierarchical compression method and a set of novel algorithms and data structure designs, are presented. Experiments on six data analytics tasks of various complexities show that TADOC can save 90.8% storage space and 87.9% memory usage, while halving data processing times.