Query endpoints and bin boundaries. Horizontal lines represent query ranges. Dotted vertical lines mark query endpoints.

Source publication

Optimizing candidate check costs for bitmap indices

Conference Paper

Full-text available

Oct 2005

In this paper, we propose a new strategy for optimizing the placement of bin boundaries to minimize the cost of query evaluation using bitmap indices with binning. For attributes with a large number of distinct values, often the most efficient index scheme is a bitmap index with binning. However, this type of index may not be able to fully resolve...

Context 1

... Figure 4 a set of 10 range queries and a binning into 4 bins is shown. In this example query q3 has no edge bins since both its endpoints fall on bin boundaries. ...

View in full-text

Context 2

... a given bin b, let E(b) denote the set of queries that have bin b as an edge bin. For example, in Figure 4 E(b1) = {q1, q2}; E(b2) = {q1, q2, q4, q5, q6, q7, q8}; E(b3) = {q9}; E(b4) = {q8, q9, q10}. Let n b denote the number of data values that fall into the range defined by b, this is also the number of "1-bits" in the bitmap corresponding to b. ...

View in full-text

Context 3

... EP (Q) = e1, ..., er denote the ordered set of distinct query endpoints of queries in Q, i.e., ei ∈ EP (Q) implies that for at least one query q ∈ Q, ei = lq or ei = uq. For example, Figure 4 shows the distinct endpoints of 10 range queries. ...

View in full-text

Optimizing Scoring and Sorting Operations for Faster WAND Processing

Chapter

Full-text available

Jan 2020

Recent years, a lot of research has focused on how to improve query processing efficiency of large-scale search engines. In this paper, we focus on top-k query processing on document-sorted indexes and the well-known rank-safe dynamic pruning technique called WAND, which can efficiently reduce the hardware computational resources required for the f...

Adaptive Distributed Top-k Query Processing

Article

Full-text available

Jun 2016

ADiT is an adaptive approach for processing distributed top-$k$ queries over peer-to-peer networks optimizing both system load and query response time. This approach considers the size of the peer to peer network, the amount $k$ of searched objects, the network capabilities of a connected peer, i.e. the transmission rate, the amount of objects stor...

Illustration of skyline query and its variants a Skyline query b...

Wall clock time vs. number of groups aCarDBbNBA

Number of node accesses vs. number of groups (dimensionality = 3,...

Wall clock time vs. number of groups (dimensionality = 3,...

Number of node accesses vs. dimensionality (number of groups = 4,...

Efficient group-by reverse skyline computation

Article

Full-text available

Nov 2016

The reverse skyline query is very useful in many decision making applications. Given a multi-dimensional dataset P and a query point q, the reverse skyline query returns all the points in P whose dynamic skyline contains q. Although the reverse skyline retrieval has been well-studied in the literature, there is, to the best of our knowledge, no pri...

ieee-icenco200403012016

Presentation

Full-text available

Dec 2004

For storage and searching of data,we need in- dexing or machine learning. Bayesian networks various al- gorithms are useful in learning system and also for KBS. Both storage and query processing through Bayesian is con- sidered here with various file structure. Statistical reason- ing by Wrapper,Mni, GBNi, MNFsi algorithm which uses Bayesian learni...

AttentionAttractor: efficient video stream similarity query processing in real time

Conference Paper

Full-text available

May 2007

In a project, customers are attracted by a video streaming application. A video camera records people passing by, and a monitor shows an alienated version of the setting accordingly. The idea is to replace the image on the video screen by a mosaic of similar images to draw their attention to the location. For successful implementation, several aspe...

BinDex: A Two-Layered Index for Fast and Robust Scans

Conference Paper

Full-text available

Jun 2020

Mining Big Data with Random Forests

Article

Full-text available

Apr 2019

In the current big data era, naive implementations of well-known learning algorithms cannot efficiently and effectively deal with large datasets. Random forests (RFs) are a popular ensemble-based method for classification. RFs have been shown to be effective in many different real-world classification problems and are commonly considered one of the best learning algorithms in this context. In this paper, we develop an RF implementation called ReForeSt, which, unlike the currently available solutions, can distribute data on available machines in two different ways to optimize the computational and memory requirements of RF with arbitrarily large datasets ranging from millions of samples to millions of features. A recently proposed improved RF formulation called random rotation ensembles can be used in conjunction with model selection to automatically tune the RF hyperparameters. We perform an extensive experimental evaluation on a wide range of large datasets and several environments with different numbers of machines and numbers of cores per machine. Results demonstrate that ReForeSt, in comparison to other state-of-the-art alternatives such as MLlib, is less computationally intensive, more memory efficient, and more effective.

Parallel membership queries on very large scientific data sets using bitmap indexes

Article

Full-text available

Jan 2019
CONCURR COMP-PRACT E

Many scientific applications produce very large amounts of data as advances in hardware fuel computing and experimental facilities. Managing and analyzing massive quantities of scientific data is challenging as data are often stored in specific formatted files, such as HDF5 and NetCDF, which do not offer appropriate search capabilities. In this research, we investigated a special class of search capability, called membership query, to identify whether queried elements of a set are members of an attribute. Attributes that naturally have classification values appear frequently in scientific domains such as category and object type as well as in daily life such as zip code and occupation. Because classification attribute values are discrete and require random data access, performing a membership query on a large scientific data set creates challenges. We applied bitmap indexing and parallelization to membership queries to overcome these challenges. Bitmap indexing provides high performance not only for low cardinality attributes but also for high cardinality attributes, such as floating‐point variables, electric charge, or momentum in a particle physics data set, due to compression algorithms such as Word‐Aligned Hybrid. We conducted experiments, in a highly parallelized environment, on data obtained from a particle accelerator model and a synthetic data set.

SciSD: Novel Subgroup Discovery Over Scientific Datasets Using Bitmap Indices

Technical Report

Full-text available

Mar 2015

Subgroup discovery is a broadly applicable exploratory technique, which identifies interesting subgroups with respect to a property of interest. While there is clearly a need to apply this method to discover interesting patterns from scientific datasets comprising large-scale arrays, the existing algorithms primarily apply to relational datasets. In this paper, we present a novel algorithm, SciSD, for exhaustive but efficient subgroup discovery over array-based scientific datasets, in which all attributes are numeric. Our algorithm handles a key challenge associated with array data, which is that a subgroup identified over array data can be described based on value-based and/or dimension-based attributes. To reduce the computational costs, our SciSD algorithm extensively uses bitmap indices (and fast bitwise operations on them). We demonstrate both high efficiency and effectiveness of our algorithm by using multiple real-life datasets.

Building Hierarchical Bitmap Indices in Space Constrained Environments

Article

Feb 2015

Jong Wook Kim

Since bitmap indices are useful for OLAP queries over low-cardinality data columns, they are frequently used in data warehouses. In many data warehouse applications, the domain of a column tends to be hierarchical, such as categorical data and geographical data. When the domain of a column is hierarchical, hierarchical bitmap index is able to significantly improve the performance of queries with conditions on that column. This strategy, however, has a limitation in that when a large scale hierarchy is used, building a bimamp for each distinct node leads to a large space overhead. Thus, in this paper, we introduce the way to build hierarchical bitmap index on an attribute whose domain is organized into a large-scale hierarchy in space-constrained environments. Especially, in order to figure out space overhead of hierarchical bitmap indices, we propose the cut-selection strategy which divides the entire hierarchy into two exclusive regions.

On Index Structures for Star Query Processing in Data Warehouses

Article

Full-text available

Mar 2014

One of the important research and technological issues in data warehouse performance is the optimization of analytical queries. Most of the research have been focusing on optimizing such queries by means of materialized views, data and index partitioning, as well as various index structures including: join indexes, bitmap join indexes, multidimensional indexes or index-based multidimensional clusters. These structures neither well support navigation along dimension hierarchies nor optimize joins with the Time dimension, which in practice is used in the majority of analytical queries. In this chapter we overview the basic index structures, namely: a bitmap index, a join index, and a bitmap join index. Based on these indexes, we show how to build another index, called Time-HOBI, for optimizing queries that address the Time dimension and compute aggregates along dimension hierarchies. We further discuss the extension of the index with additional data structure for storing aggregate values along the hierarchical structure of the index. The aggregates are used for speeding up aggregate queries along dimension hierarchies. Furthermore, we show how the index is used for answering queries in an example data warehouse. Finally, we discuss its performance-related characteristics, based on experiments.

Dynamic bitmap index recompression through workload-based optimizations

Conference Paper

Full-text available

Oct 2013

Many large-scale read-only databases and data warehouses use bitmap indices in an effort to speed up data analysis. These indices have the dual properties of compressibility and being able to leverage fast bit-wise operations for query processing. Numerous hybrid run-length encoding compression schemes have been proposed that greatly compress the index and enable querying without the need to decompress. Typically, these schemes align their compression with the computer architecture's word size to further accelerate queries. Previously, we introduced Variable Length Compression (VLC), which uses a general encoding that can achieve better compression than word-aligned schemes. However, VLC's querying efficiency can vary widely due to mismatched alignment of compressed columns. In this paper, we present an optimizer which recompresses the bitmap over time. Based on query history, our approach allows the VLC user to specify the priority of compression versus query efficiency, then possibly recompress the bitmap accordingly. In an empirical study using scientific data sets, we showed that our approach was able to achieve both better compression ratios and query speedup over WAH and PLWAH. On the largest data set, our VLC optimizer compressed up to 1.73x better than WAH, and 1.46x over PLWAH. We also show a slight improvement in query efficiency in most experiments, while observing lucrative (11x to 16x) speedup in special cases.

Time–HOBI: Index for optimizing star queries

Article

Jul 2012
INFORM SYST

One of the important research and technological problems in data warehouse query optimization concerns star queries. So far, most of the research focused on optimizing such queries by means of join indexes, bitmap join indexes, or various multidimensional indexes. These structures neither support navigation well along dimension hierarchies nor optimize joins with the Time dimension, which in practice is used in most of the star queries. In this paper we propose an index, called Time–HOBI, for optimizing the star queries that compute aggregates along dimension hierarchies. Time–HOBI, created on a dimension hierarchy, is composed of (1) a Hierarchically Organized Bitmap Index (HOBI), where one bitmap index is maintained for one dimension level, and (2) a Time Index (TI) that implicitly encodes time in every dimension. HOBI allows to quickly search for fact rows satisfying predicates defined on different levels of dimension hierarchies. With the support of TI joining a fact table with the Time dimension is avoided. Thus, Time–HOBI supports a broad class of star queries. In this paper we explain how query execution plans for star queries can profit from Time–HOBI. We show, based on experiments, the efficiency of Time–HOBI for different classes of queries, as compared to HOBI and a traditional bitmap index. Based on the experiments, we also demonstrate how sensitive Time–HOBI is to variable selectivity of queries. We also analyze the maintenance time of Time–HOBI as compared to HOBI and a traditional bitmap index. The experiments used in the paper have been conducted on a real dataset, coming from the biggest East-European Internet auction platform Allegro.pl. The experiments show that Time–HOBI can be successfully applied to the optimization of star queries as it offers promising performance improvement.

Data Warehouse Performance: Selected Techniques and Data Structures

Chapter

Full-text available

Jan 2012

Robert Wrembel

Data stored in a data warehouse (DW) are retrieved and analyzed by complex analytical applications, often expressed by means of star queries. Such queries often scan huge volumes of data and are computationally complex. For this reason, an acceptable (or good) DW performance is one of the important features that must be guaranteed for DW users. Good DW performance can be achieved in multiple components of a DW architecture, starting from hardware (e.g., parallel processing on multiple nodes, fast disks, huge main memory, fast multi-core processor), through physical storage schemes (e.g., row storage, column storage, multidimensional store, data and index compression algorithms), state of the art techniques of query optimization (e.g., cost models and size estimation techniques, parallel query optimization and execution, join algorithms), and additional data structures improving data searching efficiency (e.g., indexes, materialized views, clusters, partitions). In this chapter we aim at presenting only a narrow aspect of the aforementioned technologies. We discuss three types of data structures, namely indexes (bitmap, join, and bitmap join), materialized views, and partitioned tables. We show how they are being applied in the process of executing star queries in three commercial database/data warehouse management systems, i.e., Oracle, DB2, and SQL Server.

Finding regions of interest on toroidal meshes

Article

Full-text available

Mar 2011

Fusion promises to provide clean and safe energy, and a considerable amount of research effort is under way to turn this aspiration into a reality. This work focuses on a building block for analyzing data produced from the simulation of microturbulence in magnetic confinement fusion devices: the task of efficiently extracting regions of interest. Like many other simulations where a large number of data are produced, the careful study of 'interesting' parts of the data is critical to gain understanding. In this paper, we present an efficient approach for finding these regions of interest. Our approach takes full advantage of the underlying mesh structure in magnetic coordinates to produce a compact representation of the mesh points inside the regions and an efficient connected component labeling algorithm for constructing regions from points. This approach scales linearly with the surface area of the regions of interest instead of the volume as shown with both computational complexity analysis and experimental measurements. Furthermore, this new approach is hundreds of times faster than a recently published method based on Cartesian coordinates.

Query endpoints and bin boundaries. Horizontal lines represent query ranges. Dotted vertical lines mark query endpoints.

Contexts in source publication

Similar publications

Citations