Figure - available from: The VLDB Journal
This content is subject to copyright. Terms and conditions apply.
Independent partitioning join (IPJ)

Independent partitioning join (IPJ)

Source publication
Article
Full-text available
High-bandwidth memory (HBM) gives an additional opportunity for hardware performance benefits. The high available bandwidth compared to regular DRAM allows execution of many threads in parallel, avoiding memory stalls through many concurrent memory accesses This is especially interesting considering database join algorithms optimized for multicore...

Similar publications

Preprint
Full-text available
There is no unified definition of Data anomalies, which refers to the specific data operation mode that may violate the consistency of the database. Known data anomalies include Dirty Write, Dirty Read, Non-repeatable Read, Phantom, Read Skew and Write Skew, etc. In order to improve the efficiency of concurrency control algorithms, data anomalies a...

Citations

... The recent addition of HBM to FPGAs presents an opportunity to exploit high memory bandwidth with the low-power FPGA fabric. The potential of high-bandwidth memory [179,252] has been explored in many-core processors [149,376] and GPUs [149,539]. A recent work [495] shows the potential of HBM for FPGAs with a memory benchmarking tool. ...
Preprint
The cost of moving data between the memory units and the compute units is a major contributor to the execution time and energy consumption of modern workloads in computing systems. At the same time, we are witnessing an enormous amount of data being generated across multiple application domains. These trends suggest a need for a paradigm shift towards a data-centric approach where computation is performed close to where the data resides. Further, a data-centric approach can enable a data-driven view where we take advantage of vast amounts of available data to improve architectural decisions. As a step towards modern architectures, this dissertation contributes to various aspects of the data-centric approach and proposes several data-driven mechanisms. First, we design NERO, a data-centric accelerator for a real-world weather prediction application. Second, we explore the applicability of different number formats, including fixed-point, floating-point, and posit, for different stencil kernels. Third, we propose NAPEL, an ML-based application performance and energy prediction framework for data-centric architectures. Fourth, we present LEAPER, the first use of few-shot learning to transfer FPGA-based computing models across different hardware platforms and applications. Fifth, we propose Sibyl, the first reinforcement learning-based mechanism for data placement in hybrid storage systems. Overall, this thesis provides two key conclusions: (1) hardware acceleration on an FPGA+HBM fabric is a promising solution to overcome the data movement bottleneck of our current computing systems; (2) data should drive system and design decisions by leveraging inherent data characteristics to make our computing systems more efficient.
... • sensitivity to the distribution of the number of child documents: each parent document is associated with 0 to 32 child documents; the number of child documents follows three distributions: fixed, uniform, and normal as observed in prior study [31,32]. When the number of child documents is fixed, every parent document connects to 16 child documents. ...
Article
Full-text available
The advent of byte-addressable persistent memory opens an important opportunity for document databases to read and write durable data fetching them into DRAM. Reaping the benefit of persistent memory is not straightforward, as existing document databases are tailored for disk storage. They assume that the disk and DRAM data movement dominates the performance. However, this paper points out that data indexing becomes the performance bottleneck when porting document databases to persistent memory. The paper proposes PMLiteDB, the first persistent memory document database with streamlined access paths. PMLiteDB introduces two techniques, direct reading and selective caching . Direct reading streamlines the translation from document IDs to the address of documents whenever possible by swizzling the IDs into persistent memory references . It guarantees to use only up-to-date persistent memory references when document movements invalidate associated references. Selective caching reduces data movements between DRAM and persistent memory by selectively caching only frequently accessed persistent memory data pages with a DRAM buffer. For other pages, the database loads data on them directly without caching. Compared to the design that adopts persistent memory as a fast disk without exploiting the byte-addressability, PMLiteDB achieves 2.33× on average and up to 6.18× speedup.
... Fully exploiting current and emerging hardware trends is a notorious challenge [30,36,41,44,46,58]. In particular, there are three key challenges (C1∼C3) to answer the question of how to efficiently parallelize IaWJ on multicore architectures. ...
... To optimize relational joins, past works have explored the usage of SIMD instructions [26], cache-and TLBaware data replication [5,6], memory-efficient hash table [7] and NUMA-aware data placement strategies [3]. Extensive recent efforts have also been devoted to accelerating join processing by better utilizing specialized hardware architectures such as GPU [30,44], FPGA [10,17] and high-bandwidth memory [36]. The common goal is to find a scheme that minimizes the overall join processing time by wisely managing the balance between computation and communication/partitioning cost among different architectures. ...
... The common goal is to find a scheme that minimizes the overall join processing time by wisely managing the balance between computation and communication/partitioning cost among different architectures. All those join algorithms are covered by our concerned algorithm design aspects including eager/lazy, sort/hash, and various partitioning schemes, but with novel implementations such as specialized data partitioning schemes for novel hardware architectures [30,36,44] and adaptive partitioning models [25]. All those works are highly valuable, but none of them has comprehensively evaluated different approaches to parallelize IaWJ on modern multicores. ...
Conference Paper
Full-text available
The intra-window join (IaWJ), i.e., joining two input streams over a single window, is a core operation in modern stream processing applications. This paper presents the first comprehensive study on parallelizing the IaWJ on modern multicore architectures. In particular, we classify IaWJ algorithms into lazy and eager execution approaches. For each approach, there are further design aspects to consider, including different join methods and partitioning schemes, leading to a large design space. Our results show that none of the algorithms always performs the best, and the choice of the most performant algorithm depends on: (i) workload characteristics, (ii) application requirements, and (iii) hardware architectures. Based on the evaluation results, we propose a decision tree that can guide the selection of an appropriate algorithm.
... Cheng et al. [15] demonstrated a highly parallel in-memory join on the 64-core Intel Knight's Landing platform. Pohl et al. [48] have shown how HBM (high bandwidth memory) can be used as another layer in the storage hierarchy to improve performance. Hash-and sort-based aggregation is evaluated in [46]. ...
Article
Full-text available
The join and group-by aggregation are two memory intensive operators that are affecting the performance of relational databases. Hashing is a common approach used to implement both operators. Recent paradigm shifts in multi-core processor architectures have reinvigorated research into how the join and group-by aggregation operators can leverage these advances. However, the poor spatial locality of the hashing approach has hindered performance on multi-core processor architectures which rely on using large cache hierarchies for latency mitigation. Multithreaded architectures can better cope with poor spatial locality by masking memory latency with many outstanding requests. Nevertheless, the number of parallel threads, even in the most advanced multithreaded processors, such as UltraSPARC, is not enough to fully cover the main memory access latency. In this paper, we explore the hardware re-configurability of FPGAs to enable deeper execution pipelines that maintain hundreds (instead of tens) of outstanding memory requests across four FPGAs-drastically increasing concurrency and throughput. We present two end-to-end in-memory accelerators for the join and group-by aggregation operators using FPGAs. Both accelerators use massive multithreading to mask long memory delays of traversing linked-list data structures, while concurrently managing hundreds of thread states across four FPGAs locally. We explore how content addressable memories can be intermixed within our multithreaded designs to act as a synchronizing cache , which enforces locks and merges jobs together before they are written to memory. Throughput results for our hash-join operator accelerator show a speedup between 2 $$\times $$ × and 3.4 $$\times $$ × over the best multi-core approaches with comparable memory bandwidths on uniform and skewed datasets. The accelerator for the hash-based group-by aggregation operator demonstrates that leveraging CAMs achieves average speedup of 3.3 $$\times $$ × with a best case of 9.4 $$\times $$ × in terms of throughput over CPU implementations across five types of data distributions.
... Although modern microprocessors are based on the principle of locality and the consideration of memory cost and performance, a hierarchical storage structure is used to reduce memory access latency. However, the emergence and development of multi-core and multi-threaded parallel architecture processors have put forward higher requirements for memory access bandwidth [1]. The resulting "storage wall" problem has become a bottleneck in improving the performance of computer systems [2]. ...
Article
Full-text available
In order to reduce the problem of mismatch between high-performance processors and DRAM speeds, current processors have added a cache structure, but the low cache hit rate also seriously affects the actual performance of the program. Data prefetching technology can alleviate the problems of memory access latency and low hit rate caused by the speed difference between high-performance processors and DRAM. Based on the LLVM open source compiler, this article first implements the data prefetch module on the Shenwei platform. This paper improves the prefetch distance algorithm, proposes a new prefetch scheduling algorithm, introduces a cost model to evaluate the prefetch revenue, and accurately determines the insertion timing of the prefetch instruction to improve the cache hit rate. SPEC2006 performance test results show that after optimization, Shenwei 1621 processor single-core can achieve a maximum performance improvement of 50%, and an average performance improvement of 11%.
Article
Storage-based joins are still commonly used today because the memory budget does not always scale with the data size. One of the many join algorithms developed that has been widely deployed and proven to be efficient is the Hybrid Hash Join (HHJ), which is designed to exploit any available memory to maximize the data that is joined directly in memory. However, HHJ cannot fully exploit detailed knowledge of the join attribute correlation distribution. In this paper, we show that given a correlation skew in the join attributes, HHJ partitions data in a suboptimal way. To do that, we derive the optimal partitioning using a new cost-based analysis of partitioning-based joins that is tailored for primary key - foreign key (PK-FK) joins, one of the most common join types. This optimal partitioning strategy has a high memory cost, thus, we further derive an approximate algorithm that has tunable memory cost and leads to near-optimal results. Our algorithm, termed NOCAP (Near-Optimal Correlation-Aware Partitioning) join, outperforms the state of the art for skewed correlations by up to 30%, and the textbook Grace Hash Join by up to 4×. Further, for a limited memory budget, NOCAP outperforms HHJ by up to 10%, even for uniform correlation. Overall, NOCAP dominates state-of-the-art algorithms and mimics the best algorithm for a memory budget varying from below √||relation|| to more than ||relation||.
Article
Data processing systems often leverage vector instructions to achieve higher performance. When applying vector instructions, an often overlooked data structure is the hash table, even though it is fundamental in data processing systems for operations such as indexing, aggregating, and joining. In this paper, we characterize and evaluate three fundamental vectorized hashing schemes, vectorized linear probing (VLP), vectorized fingerprinting (VFP), and bucket-based comparison (BBC). We implement these hashing schemes on the x86, ARM, and Power CPU architectures, as modern database systems must provide efficient implementations for multiple platforms due to the continuously increasing hardware heterogeneity. We present various implementation variants and platform-specific optimizations, which we evaluate for integer keys, string keys, large payloads, skewed distributions, and multiple threads. Our extensive evaluation and comparison to three scalar hashing schemes on four servers shows that BBC outperforms scalar linear probing by a factor of more than 2x, while also scaling well to high load factors. We find that vectorized hashing schemes come with caveats that need to be considered, such as the increased engineering overhead, differences between CPUs, and differences between vector ISAs, such as AVX and AVX-512, which impact performance. We conclude with key findings for vectorized hashing scheme implementations.
Article
FPGAs are increasingly being used in data centers and the cloud due to their potential to accelerate certain workloads as well as for their architectural flexibility since they can be used as accelerators, as smart-NICs, or a stand-alone processors. To meet the challenges posed by these new use cases, FPGAs are quickly evolving in terms of their capabilities and organization. The utilization of High Bandwidth Memory (HBM) in FPGA devices is one recent example of such a trend. In this paper, we study the potential of FPGAs equipped with HBM from a data analytics perspective. We consider three workloads common in analytics oriented databases and implement them on an FPGA showing in which cases they benefit from HBM: range selection, hash join, and stochastic gradient descent for linear model training. We integrate our designs into a columnar database (MonetDB) and show the trade-offs arising from the integration related to data movement and partitioning. We consider two possible configurations of the HBM, using a single and a dual clock version design. With the right design, FPGA+HBM based solutions are able to surpass the highest performance provided by either a 2-socket POWER9 ¹ system or a 14-core Xeon ² E5 by up to 5.9x (range selection), 18.3x (hash join), and 6.1x (SGD).
Article
Ongoing climate change calls for fast and accurate weather and climate modeling. However, when solving large-scale weather prediction simulations, state-of-the-art CPU and GPU implementations suffer from limited performance and high energy consumption. These implementations are dominated by complex irregular memory access patterns and low arithmetic intensity that pose fundamental challenges to acceleration. To overcome these challenges, we propose and evaluate the use of near-memory acceleration using a reconfigurable fabric with high-bandwidth memory (HBM). We focus on compound stencils that are fundamental kernels in weather prediction models. By using high-level synthesis techniques, we develop NERO, an FPGA+HBM-based accelerator connected through OCAPI (Open Coherent Accelerator Processor Interface) to an IBM POWER9 host system. Our experimental results show that NERO outperforms a 16-core POWER9 system by 5.3 × and 12.7 × when running two different compound stencil kernels. NERO reduces the energy consumption by 12 × and 35 × for the same two kernels over the POWER9 system with an energy efficiency of 1.61 GFLOPS/Watt and 21.01 GFLOPS/Watt. We conclude that employing near-memory acceleration solutions for weather prediction modeling is promising as a means to achieve both high performance and high energy efficiency.