Fig 4 - uploaded by Philip S. Yu
Content may be subject to copyright.
(a) The sequence for Dow Jones indices during 500 trading days. (b) The corresponding energy spectrum. 

(a) The sequence for Dow Jones indices during 500 trading days. (b) The corresponding energy spectrum. 

Source publication
Article
Full-text available
Data cubes have become important components in most data warehouse systems and decision support systems. In such systems, users usually pose very complex queries to the online analytical processing (OLAP) system, and systems usually have to deal with a huge amount of data because of the large dimensionality of the sets; thus, approximating query pr...

Contexts in source publication

Context 1
... number of DCT coefficients increases exponentially as the dimensionality increases. Clearly, computing all coefficients for possible selection is computationally prohibitive. Therefore, the DCT-based approaches choose and compute only the coefficients that are deemed the most representative. In practical OLAP systems, the data distribution should have a correlation among data items; that is, the frequency spectrum of the distribution should show large values in its low frequency coefficients and small values in its high frequency coefficients [8], [10]. In general, high- frequency coefficients are usually of less interest in OLAP, whereas low-frequency coefficients are of high interest, since they correspond to range-aggregation queries [20]. To select representative coefficients, several geometrical zonal sampling techniques such as triangular, spherical, rectangular, and reciprocal sampling have been proposed [21]. Lee et al. [10] extends the techniques to MDSs and shows that the reciprocal zonal sampling technique is able to select the most representative coefficients of DCT under the brown noise assumption. This sampling technique selects coefficients by the constraint f k ! u j Q d i 1⁄4 1 ð u i þ 1 Þ b g , where k ! u is a DCT coefficient located at ð u 1 ; u 2 ; . . . u d Þ , and u i is the frequency index in the i th dimension. Since only the most efficient coefficients need to be transformed, the execution time of DCT for data is reduced to O ð nk Þ , where k is the number of selected coefficients. Thus, the reciprocal zonal sampling technique improves the execution efficiency significantly. For range-sum queries, it is expensive to estimate the value of individual data points and then compute the aggregation. Lee et al. [10] introduced the integral approach to compute the aggregation efficiently. For a d -dimensional range-sum query, it costs O ð kd Þ to compute the aggregation if k coefficients are used. For example, the answer of the range-sum query requesting the sum of those cells located at be a estimated < u 1 < b and as c S < 1⁄4 u 2 R < c d R d a b in f b ð u an 1 ; u M 2 Þ du  N 1 du 2D 2 ; , data where cube f b can is defined in Section 2.2. Though not proper for estimating single data point [22], the DCT-based approaches are able to provide high-quality answers for range-sum queries due to the compensation of errors of some cell pairs. Because it does not require extra storage space for coefficient transforming, the DCT works well with small working space. However, the time complexity, estimated as O ð nB Þ , is deemed too high for the SCS applications to process the MDS blocks online. To resolve the issues of the space cost in DWT and the time cost in DCT, we propose the DAWA algorithm, which stands for an integrated algorithm of DCT for Data and DWT, to approximate the SCS. The DAWA algorithm is able to generate a snapshot from a PB in a small working buffer of only via O one ð p nB ffiffiffi N þ data B log scan, p ffiffiffiffi N with Þ , which a computational we discuss complexity further in Sections 3.2 and 3.3. The DAWA framework comprises two phases for generating snapshots. The first phase, called the DCT phase , partitions a PB into several subblocks, called DAWA cells, and then applies DCT to each DAWA cell based on the optimal set of coefficients selected by the reciprocal zonal sampling. The second phase, called the DWT phase , groups the DCT coefficients from each DAWA cell, whose time-to-frequency transformations are close to each other, into several isofrequency MDSs, denoted as IMS. As it will be shown in Section 3.1, the variance of data points in the same IMS is small, and the IMS can thus be further compressed by DWT efficiently. Also, several techniques, including null map, IMS smoothing, and global thresholding, have been developed based on the character- istics of IMS to improve the accuracy of the DAWA algorithm. The essence of DCT or DWT is based on the effectiveness of power concentration of the transformations. Basically, brown noise and random walks are the prevalent formats in real signals [23]. For brown noise, the energy is concentrated more in low frequencies, since the data points are more related to each other than those in white noise generated by the pure random process. Note that real signals have a skewed energy spectrum [8]. For example, stock movements and exchange rates can be successfully modeled as brown noise, which exhibit an energy spectrum O ð f À 2 Þ . Therefore, the DCT coefficients of such data, referred to as the amplitudes of the signals, can be modeled as O ð f À 1 Þ [8]; that is, the amplitude is inversely proportional to the frequency. A scenario of the brown noise assumption for cube streams is shown in Fig. 4, where it can be seen that the amplitudes are inversely approximately proportional to the frequencies. Also, the relationship between power density and frequencies can be explored mathematically to provide a theoretical foundation for zonal sampling. Consequently, we have the following ...
Context 2
... can be successfully modeled as brown noise, which exhibit an energy spectrum Oðf À2 Þ. Therefore, the DCT coefficients of such data, referred to as the amplitudes of the signals, can be modeled as Oðf À1 Þ [8]; that is, the amplitude is inversely proportional to the frequency. A scenario of the brown noise assumption for cube streams is shown in Fig. 4, where it can be seen that the amplitudes are inversely approximately proportional to the ...
Context 3
... shown in Fig. 14, the quality of answers from the DAWA algorithm is much better than that from DWT. The error rates of the answers from DAWA are all below 5 percent for both D-TPC and D-TEL, showing that DAWA works well under a small working buffer and is able to generate very small snapshots for the original data sets, with acceptable error rates. In ...

Similar publications

Article
Full-text available
Spatial data warehouses (SDWs) allow for spatial analysis together with analytical multidimensional queries over huge volumes of data. The challenge is to retrieve data related to ad hoc spatial query windows according to spatial predicates, avoiding the high cost of joining large tables. Therefore, mechanisms to provide efficient query processing...
Conference Paper
Full-text available
Bulk loading is used to efficiently build a table or access structure if a large data set is available at index time, e.g., the spool process of a data warehouse or the creation of intermediate results during query processing. The authors introduce the TempTris algorithm that creates a multidimensional partitioning from a one-dimensionally sorted s...
Article
Full-text available
In this paper, we present BMQ-Processor, a high-performance border-crossing event (BCE) detection framework for large-scale monitoring applications. We first characterize a new query semantics, namely, border monitoring query (BMQ), which is useful for BCE detection in many monitoring applications. It monitors the values of data streams and reports...

Citations

... We are not the first to recognize the importance of sampling to accelerate data mining computations or query evaluation. Some references include [5] to speedup K-means clustering, [4,37] to accelerate data mining algorithms, [24] to efficiently compute range queries and [2,23] to accelerate aggregation queries. Association rules are mined in one pass with a sample and refined on a second pass over the entire data set [37]. ...
Article
User-Defined Functions (UDFs) represent an extensibility mechanism provided by most DBMSs, whose execution happens in main memory. Also, UDFs leverage the DBMS multi-threaded capabilities and exploit the C language speed and flexibility for mathematical computations. In this article, we study how to accelerate computation of sufficient statistics on large data sets with UDFs exploiting caching and sampling techniques. We present an aggregate UDF computing multidimensional sufficient statistics that benefit a broad array of statistical models: the linear sum of points and the quadratic sum of cross-products of point dimensions. Caching can be applied when the data set fits in main memory. Otherwise, sampling is required to accelerate processing of very large data sets. Also, sampling can be applied on data sets that can be cached, to further accelerate processing. Experiments carefully analyze performance and accuracy with real and synthetic data sets. We compare UDFs working inside the DBMS and C++ reading flat files, running on the same hardware. We show UDFs can have similar performance to C++, even if both exploit caching and multi-threading. As expected, C++ is much faster than UDFs when the data set is scanned from disk. We carefully analyze the case where sampling is required with larger data sets. We show geometric and bootstrapping sampling techniques can be faster than performing full tables scans, providing high accuracy estimation of mean, variance and correlation. Even further, sampling on cached data sets can provide accurate answers in a few seconds. Detailed experiments illustrate UDF optimizations including diagonal matrix computation, join avoidance and acceleration with a multi-core CPU, when available. A profile of UDF run-time execution shows the UDF is slowed down by I/O when reading from disk.
Chapter
As the amount of data grows very fast inside and outside of an enterprise, it is getting important to seamlessly analyze both data types for total business intelligence. The data can be classified into two categories: structured and unstructured. For getting total business intelligence, it is important to seamlessly analyze both of them. Especially, as most of business data are unstructured text documents, including the Web pages in Internet, we need a Text OLAP solution to perform multidimensional analysis of text documents in the same way as structured relational data. We first survey the representative works selected for demonstrating how the technologies of text mining and information retrieval can be applied for multidimensional analysis of text documents, because they are major technologies handling text data. And then, we survey the representative works selected for demonstrating how we can associate and consolidate both unstructured text documents and structured relation data for obtaining total business intelligence. Finally, we present a future business intelligence platform architecture as well as related research topics. We expect the proposed total heterogeneous business intelligence architecture, which integrates information retrieval, text mining, and information extraction technologies all together, including relational OLAP technologies, would make a better platform toward total business intelligence.
Article
Stream cube computing is the important foundation of data stream multidimensional analysis. But the features of data stream (dynamic, infinity, bursty, etc) and complexity of multidimensional data structure, are confronted with great challenges, such as storage space, updating efficiency, adaptability, and so on. In many applications, users often focus on only a portion of views. A computing method based on interesting view subset is proposed in this paper. Interesting view subset and interesting path can be obtained by the information of historical queries. And if the efficiency of answering queries decreases, it should be updated with the lapse of time. The Stream-Tree structure is defined for maintaining the cells of interesting view subset and drilling paths in memory. In the running phase, the cells of Stream-Tree are continuously updated with new tuple arriving, and the old cells are deleted periodically according to the constraints of multi-level time windows. The sparse cells of Stream-Tree will not be divided into finer ones, only the high level aggregations are preserved. Experiments and analysis results indicate that the method is efficient in maintaining the stream cube cells of current time window in finite memory, and can answer the queries of users quickly.
Chapter
The paper deals with indexing of a complex type data stream stored in a database. We present a novel indexing schema and framework referred to as ReTIn (Real-Time Indexing), the objective of which is to allow indexing of complex data arriving as a stream to a database with respect to soft real-time constraints met with some level of confidence for the maximum duration of insert and select operations. The idea of ReTIn is a combination of a sequential access to the most recent data and an index-based access to less recent data stored in the database. The collection of statistics makes balancing of indexed and unindexed parts of the database efficient. We have implemented ReTIn using PostgreSQL DBMS and its GIN index. Experimental results presented in the paper demonstrate some properties and advantages of our approach.
Conference Paper
We present a new Stream OLAP framework to approximately answer queries on historical stream data, in which each cell is extended from a single value to a synopsis structure. The cell synopses can be constructed by the existing well researched methods, including Fourier, DCT, Wavelet, PLA, etc. To implement the Cube aggregation operation, we develop algorithms that aggregate multiple lower level synopses into a single higher level synopsis for those synopsis methods. Our experiments provide comparison among all used synopsis methods, and confirm that the synopsis cells can be accurately aggregated to a higher level.