Figure 4 - uploaded by Kurt Stockinger
Content may be subject to copyright.
Query endpoints and bin boundaries. Horizontal lines represent query ranges. Dotted vertical lines mark query endpoints.

Query endpoints and bin boundaries. Horizontal lines represent query ranges. Dotted vertical lines mark query endpoints.

Source publication
Conference Paper
Full-text available
In this paper, we propose a new strategy for optimizing the placement of bin boundaries to minimize the cost of query evaluation using bitmap indices with binning. For attributes with a large number of distinct values, often the most efficient index scheme is a bitmap index with binning. However, this type of index may not be able to fully resolve...

Contexts in source publication

Context 1
... Figure 4 a set of 10 range queries and a binning into 4 bins is shown. In this example query q3 has no edge bins since both its endpoints fall on bin boundaries. ...
Context 2
... a given bin b, let E(b) denote the set of queries that have bin b as an edge bin. For example, in Figure 4 E(b1) = {q1, q2}; E(b2) = {q1, q2, q4, q5, q6, q7, q8}; E(b3) = {q9}; E(b4) = {q8, q9, q10}. Let n b denote the number of data values that fall into the range defined by b, this is also the number of "1-bits" in the bitmap corresponding to b. ...
Context 3
... EP (Q) = e1, ..., er denote the ordered set of distinct query endpoints of queries in Q, i.e., ei ∈ EP (Q) implies that for at least one query q ∈ Q, ei = lq or ei = uq. For example, Figure 4 shows the distinct endpoints of 10 range queries. ...

Similar publications

Chapter
Full-text available
Recent years, a lot of research has focused on how to improve query processing efficiency of large-scale search engines. In this paper, we focus on top-k query processing on document-sorted indexes and the well-known rank-safe dynamic pruning technique called WAND, which can efficiently reduce the hardware computational resources required for the f...
Article
Full-text available
ADiT is an adaptive approach for processing distributed top-$k$ queries over peer-to-peer networks optimizing both system load and query response time. This approach considers the size of the peer to peer network, the amount $k$ of searched objects, the network capabilities of a connected peer, i.e. the transmission rate, the amount of objects stor...
Article
Full-text available
The reverse skyline query is very useful in many decision making applications. Given a multi-dimensional dataset P and a query point q, the reverse skyline query returns all the points in P whose dynamic skyline contains q. Although the reverse skyline retrieval has been well-studied in the literature, there is, to the best of our knowledge, no pri...
Presentation
Full-text available
For storage and searching of data,we need in- dexing or machine learning. Bayesian networks various al- gorithms are useful in learning system and also for KBS. Both storage and query processing through Bayesian is con- sidered here with various file structure. Statistical reason- ing by Wrapper,Mni, GBNi, MNFsi algorithm which uses Bayesian learni...
Conference Paper
Full-text available
In a project, customers are attracted by a video streaming application. A video camera records people passing by, and a monitor shows an alienated version of the setting accordingly. The idea is to replace the image on the video screen by a mosaic of similar images to draw their attention to the location. For successful implementation, several aspe...

Citations

... Range encoding can overcome the inefficiencies of traditional bitmap to accelerate the range scan operations [7,8,21]. Based on range encoding, binning further optimizes the scan performance on columns with high cardinality, where multiple values are covered in a bin/bitmap to reduce the number of bitmaps [26,27,30]. Previous bitmap approaches with binning is similar with the filter layer of BinDex. ...
... The random strategy is alternative to the equiwidth strategy. We chose a random strategy because, in the equi-width strategy, outliers may cause most of the data to concentrate in a few bins [37,52], thus reducing the accuracy of the algorithm, whereas introducing randomness can alleviate this downside. Moreover, since using the random strategy is not a computationally intensive task, it is suitable for high-dimensional datasets. ...
Article
Full-text available
In the current big data era, naive implementations of well-known learning algorithms cannot efficiently and effectively deal with large datasets. Random forests (RFs) are a popular ensemble-based method for classification. RFs have been shown to be effective in many different real-world classification problems and are commonly considered one of the best learning algorithms in this context. In this paper, we develop an RF implementation called ReForeSt, which, unlike the currently available solutions, can distribute data on available machines in two different ways to optimize the computational and memory requirements of RF with arbitrarily large datasets ranging from millions of samples to millions of features. A recently proposed improved RF formulation called random rotation ensembles can be used in conjunction with model selection to automatically tune the RF hyperparameters. We perform an extensive experimental evaluation on a wide range of large datasets and several environments with different numbers of machines and numbers of cores per machine. Results demonstrate that ReForeSt, in comparison to other state-of-the-art alternatives such as MLlib, is less computationally intensive, more memory efficient, and more effective.
... This procedure is called a candidate check, which may dominate the total query time. 21 Therefore, applying binning requires a strategy with a balance between the size of the index and the overhead. ...
... Rotem et al discussed the challenges of finding the optimal binning for range queries. 21 Koudas explored the optimal binning for a given set of equality queries and introduced a space-efficient indexing for the equality query by jointly encoding the attribute values using the frequency of access and occurrence of the attribute values. 53 Stockinger et al investigated the optimization of multidimensional range queries for relatively small sets of known queries by using additional operations on bitmaps to reduce the number of checks for the candidates. ...
Article
Full-text available
Many scientific applications produce very large amounts of data as advances in hardware fuel computing and experimental facilities. Managing and analyzing massive quantities of scientific data is challenging as data are often stored in specific formatted files, such as HDF5 and NetCDF, which do not offer appropriate search capabilities. In this research, we investigated a special class of search capability, called membership query, to identify whether queried elements of a set are members of an attribute. Attributes that naturally have classification values appear frequently in scientific domains such as category and object type as well as in daily life such as zip code and occupation. Because classification attribute values are discrete and require random data access, performing a membership query on a large scientific data set creates challenges. We applied bitmap indexing and parallelization to membership queries to overcome these challenges. Bitmap indexing provides high performance not only for low cardinality attributes but also for high cardinality attributes, such as floating‐point variables, electric charge, or momentum in a particle physics data set, due to compression algorithms such as Word‐Aligned Hybrid. We conducted experiments, in a highly parallelized environment, on data obtained from a particle accelerator model and a synthetic data set.
... Although bitmap indexing was initially proposed in the context of data warehouses [46,77,78], recently it has been widely applied in the area of scientific data management [17,53,54,61,73]. To improve query efficiency, different binning strategies [24,48,55,76] and encoding methods [33] have been proposed. Furthermore, bitmap indexing has also shown its capability of assisting various data analysis tasks [8,52,59,60,62]. ...
Technical Report
Full-text available
Subgroup discovery is a broadly applicable exploratory technique, which identifies interesting subgroups with respect to a property of interest. While there is clearly a need to apply this method to discover interesting patterns from scientific datasets comprising large-scale arrays, the existing algorithms primarily apply to relational datasets. In this paper, we present a novel algorithm, SciSD, for exhaustive but efficient subgroup discovery over array-based scientific datasets, in which all attributes are numeric. Our algorithm handles a key challenge associated with array data, which is that a subgroup identified over array data can be described based on value-based and/or dimension-based attributes. To reduce the computational costs, our SciSD algorithm extensively uses bitmap indices (and fast bitwise operations on them). We demonstrate both high efficiency and effectiveness of our algorithm by using multiple real-life datasets.
... 비트맵 인덱스는 낮은 카디널리티를 갖는 컬럼에 대한 질의 처리에 있어서는 높은 성능을 보이고 있지만, 높은 카디널리티(high cardinality) 를 갖는 컬럼에 대해서는 저장 공간 오버헤드가 크다는 문제점을 가지고 있다. 이러한 저장 공간 오버헤드 문제를 해결하기 위한 방안으로 비닝 (binning) 기법이 많이 연구 되어 왔다[3,4,5,6,7,8,12,13]. 비닝은 컬럼이 가질 수 있는 데이터 값들을 몇 개의 빈으로 분할한 후, 각 빈 에 대하여 비트맵을 생성하는 기법이다. ...
Article
Since bitmap indices are useful for OLAP queries over low-cardinality data columns, they are frequently used in data warehouses. In many data warehouse applications, the domain of a column tends to be hierarchical, such as categorical data and geographical data. When the domain of a column is hierarchical, hierarchical bitmap index is able to significantly improve the performance of queries with conditions on that column. This strategy, however, has a limitation in that when a large scale hierarchy is used, building a bimamp for each distinct node leads to a large space overhead. Thus, in this paper, we introduce the way to build hierarchical bitmap index on an attribute whose domain is organized into a large-scale hierarchy in space-constrained environments. Especially, in order to figure out space overhead of hierarchical bitmap indices, we propose the cut-selection strategy which divides the entire hierarchy into two exclusive regions.
... As a consequence, bitmap indexes defined on attributes of high cardinalities become very large or too large to be efficiently processed in main memory [56]. In order to improve the efficiency of accessing data with the support of bitmap indexes defined on attributes of high cardinalities, either different kinds of bitmap encodings have been proposed, e.g., [6,26,41,47,55] or compression techniques have been developed, e.g., [2,45,46,54]. ...
Article
Full-text available
One of the important research and technological issues in data warehouse performance is the optimization of analytical queries. Most of the research have been focusing on optimizing such queries by means of materialized views, data and index partitioning, as well as various index structures including: join indexes, bitmap join indexes, multidimensional indexes or index-based multidimensional clusters. These structures neither well support navigation along dimension hierarchies nor optimize joins with the Time dimension, which in practice is used in the majority of analytical queries. In this chapter we overview the basic index structures, namely: a bitmap index, a join index, and a bitmap join index. Based on these indexes, we show how to build another index, called Time-HOBI, for optimizing queries that address the Time dimension and compute aggregates along dimension hierarchies. We further discuss the extension of the index with additional data structure for storing aggregate values along the hierarchical structure of the index. The aggregates are used for speeding up aggregate queries along dimension hierarchies. Furthermore, we show how the index is used for answering queries in an example data warehouse. Finally, we discuss its performance-related characteristics, based on experiments.
... The idea of a feedback control loop that monitor workloads and system metrics for database tuning has been well explored, e.g., [24,25,26,27,28]. The works of Koudas [29] and Rotem et al. [30] are probably the most closely related work in this area. They both consider query workloads to try optimize the boundary selection of bins to minimize the cost of querying bitmap indices. ...
Conference Paper
Full-text available
Many large-scale read-only databases and data warehouses use bitmap indices in an effort to speed up data analysis. These indices have the dual properties of compressibility and being able to leverage fast bit-wise operations for query processing. Numerous hybrid run-length encoding compression schemes have been proposed that greatly compress the index and enable querying without the need to decompress. Typically, these schemes align their compression with the computer architecture's word size to further accelerate queries. Previously, we introduced Variable Length Compression (VLC), which uses a general encoding that can achieve better compression than word-aligned schemes. However, VLC's querying efficiency can vary widely due to mismatched alignment of compressed columns. In this paper, we present an optimizer which recompresses the bitmap over time. Based on query history, our approach allows the VLC user to specify the priority of compression versus query efficiency, then possibly recompress the bitmap accordingly. In an empirical study using scientific data sets, we showed that our approach was able to achieve both better compression ratios and query speedup over WAH and PLWAH. On the largest data set, our VLC optimizer compressed up to 1.73x better than WAH, and 1.46x over PLWAH. We also show a slight improvement in query efficiency in most experiments, while observing lucrative (11x to 16x) speedup in special cases.
... As a consequence, bitmap indexes defined on attributes of high cardinalities become very large or too large to be efficiently processed in main memory [47]. In order to improve the efficiency of accessing data with the support of bitmap indexes defined on attributes of high cardinalities, either different kinds of bitmap encodings have been proposed, e.g., [5,24,35,40,46] or compression techniques have been developed, e.g., [1,38,39,45]. ...
Article
One of the important research and technological problems in data warehouse query optimization concerns star queries. So far, most of the research focused on optimizing such queries by means of join indexes, bitmap join indexes, or various multidimensional indexes. These structures neither support navigation well along dimension hierarchies nor optimize joins with the Time dimension, which in practice is used in most of the star queries. In this paper we propose an index, called Time–HOBI, for optimizing the star queries that compute aggregates along dimension hierarchies. Time–HOBI, created on a dimension hierarchy, is composed of (1) a Hierarchically Organized Bitmap Index (HOBI), where one bitmap index is maintained for one dimension level, and (2) a Time Index (TI) that implicitly encodes time in every dimension. HOBI allows to quickly search for fact rows satisfying predicates defined on different levels of dimension hierarchies. With the support of TI joining a fact table with the Time dimension is avoided. Thus, Time–HOBI supports a broad class of star queries. In this paper we explain how query execution plans for star queries can profit from Time–HOBI. We show, based on experiments, the efficiency of Time–HOBI for different classes of queries, as compared to HOBI and a traditional bitmap index. Based on the experiments, we also demonstrate how sensitive Time–HOBI is to variable selectivity of queries. We also analyze the maintenance time of Time–HOBI as compared to HOBI and a traditional bitmap index. The experiments used in the paper have been conducted on a real dataset, coming from the biggest East-European Internet auction platform Allegro.pl. The experiments show that Time–HOBI can be successfully applied to the optimization of star queries as it offers promising performance improvement.
... Extensions to the Structure of the Basic Bitmap Index. In [82] (called range-based bitmap indexing) and in [65,66,71] (called binning), values of an indexed attribute are partitioned into ranges. A bitmap is constructed for representing a given range of values, rather than a distinct value. ...
Chapter
Full-text available
Data stored in a data warehouse (DW) are retrieved and analyzed by complex analytical applications, often expressed by means of star queries. Such queries often scan huge volumes of data and are computationally complex. For this reason, an acceptable (or good) DW performance is one of the important features that must be guaranteed for DW users. Good DW performance can be achieved in multiple components of a DW architecture, starting from hardware (e.g., parallel processing on multiple nodes, fast disks, huge main memory, fast multi-core processor), through physical storage schemes (e.g., row storage, column storage, multidimensional store, data and index compression algorithms), state of the art techniques of query optimization (e.g., cost models and size estimation techniques, parallel query optimization and execution, join algorithms), and additional data structures improving data searching efficiency (e.g., indexes, materialized views, clusters, partitions). In this chapter we aim at presenting only a narrow aspect of the aforementioned technologies. We discuss three types of data structures, namely indexes (bitmap, join, and bitmap join), materialized views, and partitioned tables. We show how they are being applied in the process of executing star queries in three commercial database/data warehouse management systems, i.e., Oracle, DB2, and SQL Server.
... A disadvantage is that the binned index only answers some of the queries precisely; for others it has to go back to the base data to get precise answers. This extra step can be expensive (Rotem et al 2005, Wu et al 2008. However, in many cases, such an index is able to answer all user queries without going back to the base data. ...
Article
Full-text available
Fusion promises to provide clean and safe energy, and a considerable amount of research effort is under way to turn this aspiration into a reality. This work focuses on a building block for analyzing data produced from the simulation of microturbulence in magnetic confinement fusion devices: the task of efficiently extracting regions of interest. Like many other simulations where a large number of data are produced, the careful study of 'interesting' parts of the data is critical to gain understanding. In this paper, we present an efficient approach for finding these regions of interest. Our approach takes full advantage of the underlying mesh structure in magnetic coordinates to produce a compact representation of the mesh points inside the regions and an efficient connected component labeling algorithm for constructing regions from points. This approach scales linearly with the surface area of the regions of interest instead of the volume as shown with both computational complexity analysis and experimental measurements. Furthermore, this new approach is hundreds of times faster than a recently published method based on Cartesian coordinates.