Book

Indexing Techniques for Advanced Database Systems

Authors:

Abstract

Recent years have seen an explosive growth in the use of new database applications such as CAD/CAM systems, spatial information systems, and multimedia information systems. The needs of these applications are far more complex than traditional business applications. They call for support of objects with complex data types, such as images and spatial objects, and for support of objects with wildly varying numbers of index terms, such as documents. Traditional indexing techniques such as the B-tree and its variants do not efficiently support these applications, and so new indexing mechanisms have been developed. As a result of the demand for database support for new applications, there has been a proliferation of new indexing techniques. The need for a book addressing indexing problems in advanced applications is evident. For practitioners and database and application developers, this book explains best practice, guiding the selection of appropriate indexes for each application. For researchers, this book provides a foundation for the development of new and more robust indexes. For newcomers, this book is an overview of the wide range of advanced indexing techniques. Indexing Techniques for Advanced Database Systems is suitable as a secondary text for a graduate level course on indexing techniques, and as a reference for researchers and practitioners in industry.

Chapters (6)

There has been a growing acceptance of the object-oriented data model as the basis of next generation database management systems (DBMSs). Both pure object-oriented DBMS (OODBMSs) and object-relational DBMS (ORDBMSs) have been developed based on object-oriented concepts. Object-relational DBMS, in particular, extend the SQL language by incorporating all the concepts of the object-oriented data model. A large number of products for both categories of DBMS is today available. In particular, all major vendors of relational DBMSs are turning their products into ORDBMSs [Nori, 1996].
Many applications (such as computer-aided design (CAD), geographic information systems (GIS), computational geometry and computer vision) operate on spatial data. Generally speaking, spatial data are associated with spatial coordinates and extents, and include points, lines, polygons and volumetric objects.
Images have always been an essential and effective medium for presenting visual data. With advances in today’s computer technologies, it is not surprising that in many applications, much of the data is images. In medical applications, images such as X-rays, magnetic resonance images and computer tomography images are frequently generated and used to support clinical decision making. In geographic information systems, maps, satellite images, demographics and even tourist information are often processed, analyzed and archived. In police department criminal databases, images like fingerprints and pictures of criminals are kept to facilitate identification of suspects. Even in offices, information may arrive in many different forms (memos, documents, and faxes) that can be digitized electronically and stored as images.
Apart from some primary keys and keys that rarely change, many attributes evolve and take new values over time. For example, in an employee relation, employees’ titles may change as they take on new responsibilities, as will their salaries as a result of promotion or increment. Traditionally, when data is updated, its old copy is discarded and the most recent version is captured. Conventional databases that have been designed to capture only the most recent data are known as snapshot databases. With the increasing awareness of the values of the history of data, maintenance of old versions of records becomes an important feature of database systems.
Text databases provide rapid access to collections of digital documents. Such databases have become ubiquitous: text search engines underlie the online text repositories accessible via the Web and are central to digital libraries and online corporate document management.
Because performance is a crucial issue in database systems, indexing techniques have always been an area of intense research and development. Advances in indexing techniques are primarily driven from the need to support different data models, such as the object-oriented data model, and different data types, such as image and text data. However, advances in computer architectures may also require significant extensions to traditional indexing techniques. Such extensions are required to fully exploit the performance potential of new architectures, such as in the case of parallel architectures, or to cope with limited computing resources, such as in the case of mobile computing systems. New application areas also play an important role in dictating extensions to indexing techniques and in offering wider contexts in which traditional techniques can be used.
... The classical spatial approaches in indexing often tend to linearise data so as to use known "fast" structures. Such is the case for quadtrees, kd-trees (Samet 1984;Ooi, 1997) or other methods for spatial objects, or more accurately spatial points. Kd-trees are related to binary trees. ...
... The bounding rectangles can then be regrouped within bigger rectangles so as to create a balanced tree. The R-tree, and its sibling R*-tree (Ooi, 1997) are examples of this. While the R-trees allow to work with complex objects (approximated as rectangles and not points), their higher building and querying time make the use of lighter structures appealing to index points. ...
... Regarding temporal approach, it is important to note that different notions of time can be used for databases (Ooi, 1997). The Transaction Time allows users to perform "rollbacks" so as to find past-values. ...
... As modern databases increasingly integrate various types of information, such as multimedia data, it becomes necessary to support efficient retrieval in such systems. Example of such applications include Multimedia Information Systems [9], CAD/CAM [4], Geographical Information Systems (GIS) [7], timeseries databases [10], medical imaging [8].The data is usually represented by a feature vector which summarizes the original data with some number of dimensions. The representation of multidimensional point data is a central issue in database design, as well as applications in many other fields, including computer graphics, computer vision, computational geometry, image processing, geographic information systems (GIS), pattern recognition, very large scale integration (VLSI) design, and others. ...
... A very popular and effective technique employed to overcome the curse of dimensionality is the Vector Approximation File (VA-File) [4]. In the VA-File, the space is partitioned into a number of hyper-rectangular cells, which approximate the data that reside inside the cells. ...
... Thus, we again stress that the notion of a cluster is imprecise, and the best definition depends on the type of data and the desired results. [4] such as Rtrees [11] have been shown to be fine Efficient even for supporting range/window queries in high-dimensional databases; they, however, form the basis for indexes designed for high-dimensional databases [12,17]. To reduce the effect of high dimensionalities, use of bigger nodes [3], dimensionality reduction [7] and filter-and-refer methods [ [6] have been proposed. ...
Article
We consider approaches for similarity search in correlated, high-dimensional data-sets, which are derived within a clustering framework. We note that indexing by "vector approximation" (VA-File), which was proposed as a technique to combat the "Curse of Dimensionality", employs scalar quantization, and hence necessarily ignores dependencies across dimensions, which represents a source of sub optimality. Clustering, on the other hand, exploits inter-dimensional correlations and is thus a more compact representation of the data-set. While clustering has a long history and a large number of clustering techniques have been developed in statistics, pattern recognition, data mining, and other fields, significant challenges still remain Recently, a vector approximation based technique called VA-File has been proposed for indexing high dimensional data. It has been shown that the VA-file is an effective technique compared to the current approaches based on space and data partitioning. The VA-file gives good performance especially when the data set is uniformly distributed. Real data sets are not uniformly distributed, are often clustered, and the dimensions of the feature vectors in real datasets are usually correlated. the VA-File, over a wide range of quantization resolutions, it is able to reduce random IO accesses, given (roughly) the same amount of sequential IO operations, by factors reaching 100X and more.
... On obtient alors une hiérarchie. Au sein de ces arbres, on peut citer les R-tree (Gutman, 1984), R* & R+-trees, BV-tree (Bertino, 1997), DR-tree (Lee, 2001), Pyramid-tree, SS & SR-trees, TV-tree, X-tree (Berchtold, 1996), XBR-tree (Vassilakopoulos, 1999), etc. ...
... Les architectures vues ci-dessus ne sont pas les seules existantes. En effet, on peut également citer les méthodes de hachage dynamique (Grid File, Plop Hashing, etc.), ou encore celles basées sur les quadtrees, ou octrees pour un espace en trois dimensions (Bertino, 1997). ...
... Au sein de cette dernière on pourra citer les Append-Only-tree , B+-tree avec ordre linéaire, Interval B-tree, Multi Version B-tree, NST-tree, Time Index, TSB-tree. (Bertino, 1997) On s'intéressera plus particulièrement à l'Append Only-tree. ...
Article
Full-text available
Les systèmes de base de données capteurs sont de plus en plus fréquemment utilisés pour la surveillances de milieux à risques. Ces systèmes sont usuellement composés d'un ensemble de capteurs envoyant les mesures effectuées vers une base de données centralisée. La fréquence des mesures aussi bien que les besoins des utilisateurs imposent à la base des contraintes temps-réel douces, tout en valorisant les données les plus récentes. Quant au besoin semi-généralisé d'accéder aux données en fonction de critères spatiaux, il impose à son tour des caractéristiques spatiotemporelles. Afin de répondre aux spécificités de ces systèmes, cet article propose deux méthodes d'indexation de données. La première, dédiée à l'indexation d'un grand nombre de données issues de capteurs fixes se nomme le PoTree. Son évolution, le PasTree, privilégie la gestion de l'agilité des capteurs et l'augmentation des méthodes d'interrogation. ABSTRACT. More and more risk monitoring systems use sensor databases. These systems usually consist of an array of sensors sending their measurements toward a central database. The measurement frequency as well as the user requirements impose soft real-time constraints to the database, and tend to set the focus on the newest data. As for the need to access the data through spatial criterium, it adds spatiotemporal specificities. So as to meet these requirements, this paper propose two data access methods. The first one, dedicated to systems with a high number of data from fixed sensors is named the PoTree. Its evolution, the PasTree, focuses on sensor agility management and adding new querying patterns.
... The spatial approaches in indexing often tend to linearize the data so as to use known "fast" structures. Such is the case for quadtrees, kd-trees (Ooi, Tan, 1997) or other methods for spatial objects, or more accurately spatial points. Kd-trees are related to binary trees. ...
... The bounding rectangles can then be regrouped within bigger rectangles so as to create a balanced tree. The Rtree , and its sibling R*-tree (Ooi, Tan, 1997) are examples of this. While the R-trees allow to work with complex objects (approximated as rectangles and not points), their higher building and querying time make the use of lighter structures appealing to index points. ...
... For the temporal approach, it is important to note that different notions of time can be used for databases (Ooi, Tan, 1997*). The Transaction Time allows users to perform "rollbacks" so as to find past-values. ...
Article
Full-text available
The goal of this paper is to underline the importance of real-time systems for managing information during the phase of disaster monitoring. We stress the importance of soft real-time GIS, and we present a list of barriers to overcome in order to get this kind of system. Among the barriers, we present a solution for real-time indexing of spatio-temporal data based on a data structure named PO-Tree.
... El acceso a la imagen se hace a través de la descripción que le acompaña. Hay varios autores que comentan en sus trabajos las limitaciones de estos sistemas [32, 33, 14, 19, 4, 27, 16]. En la actualidad se trabaja en metodologías que analicen las consultas mediante lenguaje natural [30] como una vía que facilite la realización de búsquedas sin los inconvenientes citados anteriormente. ...
... Elisa Bertino y sus colegas [4] han propuesto un modelo de arquitectura de un sistema de Base de Datos de Imágenes. Esta arquitectura, que se representa en la figura 1, refleja los módulos y funcionalidad de estos sistemas. ...
... El módulo de consulta es el que permite al usuario realizar búsquedas, utilizando para ello los distintos tipos de métodos de consulta posibles. En el modelo presentado por Bertino y otros [4], se asume que se realiza una extracción de características en el módulo de consulta, motivo por el que se puede afirmar que es un modelo orientado a las consultas mediante un ejemplo o un esbozo de la imagen. El proceso de consulta se puede realizar de diversas maneras: 1. Utilizando una imagen de ejemplo para solicitar al sistema imágenes similares. ...
Article
Full-text available
Resumen En este trabajo se presenta en primer lugar una rápida revisión de las bases de datos de imágenes. Se exponen las razones que ocasionaron su origen y evolución, agrupándolas en dos categorías según su funcionalidad. A continuación se examinan los distintos tipos de Bases de Datos de Imágenes comentando brevemente sus características y estableciendo una división entre los sistemas considerados como sencillos y los Sistemas de Recuperación de Imágenes Basados en el Contenido (SRIBC). Posteriormente se presenta una arquitectura, debida a Bertino, describiendo brevemente los distintos módulos y los principales procesos que incorporan los SRIBC. También se comentan brevemente otras arquitecturas recientemente aparecidas en diferentes artículos. Se propone una nueva arquitectura más general que las anteriores y que da respuesta a los distintos métodos de consulta utilizados en los SRIBC actuales. Se explica el funcionamiento de los distintos módulos de esta arquitectura, tanto en la fase de poblamiento como en la de consulta. Se expone brevemente sus principales etapas así como las posibles rutas de datos que sigue la información según sea el método de consulta empleado. Palabras clave: Bases de Datos de Imágenes, arquitecturas de Bases de Datos de Imágenes, Sistemas de Recuperación de Imágenes Basados en el Contenido, procesamiento de imágenes.
... First, we add a Time-span 4 entity specifying the interval of existence between two dates in the real world for each Geospatial Feature to note the feature's valid time (e.g. a building built in 1975 and destroyed in 2012). Spatial Nodes can also be associated with a Time-span to use spatio- temporal indexing methods (Theodoridis et al. 1998, Hadjieleftheriou et al. 2002, Mokbel et al. 2003 or temporal indexing methods (Bertino et al. 1997). While organization of spatio-temporal features using a spatial indexing method is easy to implement, it may not be optimal. ...
... Geospatial web delivery standards allow data to be organized according to spatial indexing methods (Azri et al. 2013). The temporal extension proposed in this paper makes it possible to use spatio-temporal indexing methods (Theodoridis et al. 1998, Hadjieleftheriou et al. 2002, Mokbel et al. 2003 or temporal indexing methods (Bertino et al. 1997). An interesting challenge would then be to study the impact of these indexing methods on querying and visualizing 3D tiles with 3DTiles_temporal extension datasets. ...
Article
Full-text available
Studying and planning urban evolution is essential for understanding the past and designing the cities of the future and can be facilitated by providing means for sharing, visualizing, and navigating in cities, on the web, in space and in time. Standard formats, methods, and tools exist for visualizing large-scale 3D cities on the web. In this paper, we go further by integrating the temporal dimension of cities in geospatial web delivery standard formats. In doing so, we enable interactive visualization of large-scale time-evolving 3D city models on the web. A key characteristic of this paper lies in the proposed four-step generic approach. First, we design a generic conceptual model of standard formats for delivering 3D cities on the web. Then, we formalize and integrate the temporal dimension of cities into this generic conceptual model. Following which, we specify the conceptual model in the 3D Tiles standard at logical and technical specification levels, resulting in an extension of 3D Tiles for delivering time-evolving 3D city models on the web. Finally, we propose an open-source implementation, experiments, and an evaluation of the propositions and visualization rules. We also provide access to reproducibility notes allowing researchers to replicate all the experiments.
... The features of the data models are also compared. The importance of indexing (Bertino et al., 2012) in database has been proven very efficient since a long period of time, particularly when data size is huge. Indexing in Big Data is a challenge comparatively due to several reasons. ...
... Indexes support the efficient execution for more often used queries, especially for the read operations (Bertino et al., 2012). An index may include a column or more and sometimes the size of an index may grow larger than the table itself, it is being created for, but eventually provides the rapid lookup and fast access of the data and this compensates for the overhead of having indexes. ...
Article
Full-text available
Today, in data science, the term Big Data has attracted a large set of audience from various diversified research and industries, who sniff the potential of Big Data in solving the complex problems. In this age, the decision making processes are largely data dependent. Though, the concept of Big Data is in the midst of the evolution with great research and business opportunities, the challenges are enormous and growing equally too, starting from data collection up to decision making. This motivates various scientific disciplines to conglomerate their efforts for deep exploration of all dimensions of Big Data to procure evolutionary outcomes. The considerable velocity of the volume expansion and variety of the data pose the serious challenges to the existing data processing systems. Especially, in last few years, the volume of the data has grown manyfold. The data storages have been inundated by various disparate potential data outlets, leading by social media such as Facebook, Twitter, etc. The existing data models are largely unable to illuminate the full potential of Big Data; the information that may serve as the key solution to several complex problems is left unexplored. The existing computation capacity falls short for the increasingly expanded storage capacity. The fast-paced volume expansion of the unorganized data entails a complete paradigm shift in new age data computation and witnesses the evolution of new capable data engineering techniques such as capture, curation, visualization, analyses, etc. In this paper, we provide the first level classification for modern Big Data models. Some of the leading representatives of each classification that claim to best process the Big Data in reliable and efficient way are also discussed. Also, the classification is further strengthened by the intra-class and inter-class comparisons and discussions of the undertaken Big Data models.
... The features of the data models are also compared. The importance of indexing ( Bertino et al., 2012) in database has been proven very efficient since a long period of time, particularly when data size is huge. Indexing in Big Data is a challenge comparatively due to several reasons. ...
... Indexes support the efficient execution for more often used queries, especially for the read operations ( Bertino et al., 2012). An index may include a column or more and sometimes the size of an index may grow larger than the table itself, it is being created for, but eventually provides the rapid lookup and fast access of the data and this compensates for the overhead of having indexes. ...
Article
Full-text available
Today, in data science, the term Big Data has attracted a large set of audience from various diversified directly or distantly related domains, in research and industry both, who sniff the potential of Big Data in solving the large and complex problems. In this data science age, the decision making processes are largely data dependent. Though, the concept of Big Data is in the midst of the evolution with great research and business opportunities, the challenges are enormous and growing equally too, starting from data collection up to decision making. This motivates various scientific disciplines to conglomerate their efforts for deep exploration of all dimensions of Big Data to procure evolutionary outcomes. The considerable velocity of the volume expansion and variety of the data pose the serious challenges to the existing data processing systems. Especially, in last few years, the volume of the data has grown manyfold (beyond petabytes). The data storages have been inundated by various disparate potential data outlets, leading by social media such as Facebook, Twitter, etc. The existing data models are largely unable to illuminate the full potential of Big Data; the information that may serve as the key solution to several complex problems is left unexplored. The existing computation capacity falls short for the increasingly expanded storage capacity. The fast-paced volume expansion of the unorganized data entails a complete paradigm shift in new age data computation and witnesses the evolution of new capable data engineering techniques such as capture, curation, visualization, analyses, etc. In this paper, we provide the first level classification for modern Big Data models. Some of the leading NoSQL (largely being translated as “not only SQL”) representatives of each classification that claim to best process the Big Data in reliable and efficient way are also discussed. Also, the classification is further strengthened by the intra-class and interclass comparisons and discussions of the undertaken NoSQL Big Data models.
... In OQL, path expressions are defined as a chain of objects and methods/attributes in the so-called object composition graph [15] (a.k.a. aggregation graph [16]). In the object composition graph, an object o 1 has a directed edge to another object o 2 if and only if o 1 has an attribute or method whose value or result is in the class of object o 2 . ...
... Another optimization technique is using indexed scan on the paths when an index is available. Optimization based on indexes are beyond the topic of this paper, and we refer the interested readers to [16]. Join-based evaluation of XPath path expressions is possible, and we discuss it along with an approach based on tree pattern matching (TPM), and a hybrid approach combining the two. ...
Article
Path expressions are ubiquitous in XML processing languages such as XPath, XQuery, and XSLT. Expressions in these languages typically include multiple path expressions, some of them correlated. Existing approaches evaluate these path expressions one-at-a-time and miss the optimization opportunities that may be gained by exploiting the correlations among them. In this paper, we address the evaluation and optimization of correlated path expressions. In particular, we propose two types of optimization techniques: integrating correlated path expressions into a single pattern graph, and rewriting the pattern graph according to a set of rewriting rules. The first optimization technique allows the query optimizer to choose an execution plan that is impossible by using the existing approaches. The second optimization technique rewrites pattern graphs at a logical level and produce a set of equivalent pattern graphs from which a physical optimizer can choose given an appropriate cost function. Under certain conditions that we identify, the graph pattern matching-based execution approach that we propose may be more efficient than the join-based approaches.
... B+ Tree indexes have been broadly used in data heavy systems to ease query retrieval. It is widely accepted due to height balanced tree [13]. In more detail, each path from the root of the tree to a leaf of the tree is the same length. ...
Conference Paper
Full-text available
The relevance of this paper to the topics of the conference is that this paper provides a contribution to the area of systems communications, since this paper participates in the improvement of the Wireless Response Systems (WRS). In more detail, the rapid development of computer and wireless technologies improves many aspects of daily life. The objective of this research is to develop a database for the WRS in order to gain an efficient, fast, and reliable database management system. Furthermore, this research proposes a generic database structure for the Wireless Response System. Moreover, it investigates and studies advanced database indexing techniques and then performs a comparison between them. Subsequently, this work makes an argument to find out the most appropriate indexing technique for the WRS. Consequently, this research has achieved a great deal of success and has met the objectives and aims. To conclude, a framework for the Wireless Response System database has been developed. Additionally, the B+ Tree and Hash indexing techniques have been examined successfully. Thus, it is found that the B+ Tree is a powerful technique for this particular system.
... Relaxing the goal of finding exactly the nearest neighbors allows to significantly run faster a search of nearest neighbors. Many of those methods are based on indexing Bertino et al. (1997), constructing a multi-dimensional index structure that provides a mapping between a query sample and the ordering on the clustered index values, speeding up the search of neighbors. ...
Article
Full-text available
The k‐nearest neighbors algorithm is characterized as a simple yet effective data mining technique. The main drawback of this technique appears when massive amounts of data—likely to contain noise and imperfections—are involved, turning this algorithm into an imprecise and especially inefficient technique. These disadvantages have been subject of research for many years, and among others approaches, data preprocessing techniques such as instance reduction or missing values imputation have targeted these weaknesses. As a result, these issues have turned out as strengths and the k‐nearest neighbors rule has become a core algorithm to identify and correct imperfect data, removing noisy and redundant samples, or imputing missing values, transforming Big Data into Smart Data—which is data of sufficient quality to expect a good outcome from any data mining algorithm. The role of this smart data gleaning algorithm in a supervised learning context are investigated. This includes a brief overview of Smart Data, current and future trends for the k‐nearest neighbor algorithm in the Big Data context, and the existing data preprocessing techniques based on this algorithm. We present the emerging big data‐ready versions of these algorithms and develop some new methods to cope with Big Data. We carry out a thorough experimental analysis in a series of big datasets that provide guidelines as to how to use the k‐nearest neighbor algorithm to obtain Smart/Quality Data for a high‐quality data mining process. Moreover, multiple Spark Packages have been developed including all the Smart Data algorithms analyzed. This article is categorized under: • Technologies > Data Preprocessing • Fundamental Concepts of Data and Knowledge > Big Data Mining • Technologies > Classification
... The distinguished properties of graph databases include graph storage and graph processing. Some Graph databases offer their native graph storage while others store graph data serially into general purpose database such as relational database [25], object-oriented database [26] and NoSQL store [27] (other than graph store). The approach used by graph database in which adjacent nodes directly point to each other is termed as index free adjacency. ...
Article
Full-text available
Graph is an expressive way to represent dynamic and complex relationships in highly connected data. In today’s highly connected world, general purpose graph databases are providing opportunities to experience benefits of semantically significant networks without investing on the graph infrastructure. Examples of prominent graph databases are Neo4j, Titan, and OrientDB etc. In biological OMICS landscape, Interactomics is one of the new disciplines that focuses mainly on the data modeling, data storage, and retrieval of biological interaction data. Biological experiments generate the prodigious amount of data in various formats(semi-structured or unstructured). The large volume of such data posses challenges for data acquisition, data integration, multiple data modalities (either data model of storage model, storage, processing, and visualization. This paper aims at designing a well suited graphical data storage model for biological information which is collected from major heterogeneous biological data repositories, by using graph database
... The distinguished properties of graph databases include graph storage and graph processing. Some Graph databases offer their native graph storage while others store graph data serially into general purpose database such as relational database [25], object-oriented database [26] and NoSQL store [27] (other than graph store). The approach used by graph database in which adjacent nodes directly point to each other is termed as index free adjacency. ...
Article
Full-text available
Graph is an expressive way to represent dynamic and complex relationships in highly connected data. In today’s highly connected world, general purpose graph databases are providing opportunities to experience benefits of semantically significant networks without investing on the graph infrastructure. Examples of prominent graph databases are: Neo4j, Titan and OrientDB etc. In biological OMICS landscape, Interactomics is one of the new disciplines that focuses mainly on the data modeling, data storage and retrieval of biological interaction data. Biological experiments generate prodigious amount of data in various formats(semi-structured or unstructured). The large volume of such data posses challenges for data acquisition, data integration, multiple data modalities (either data model of storage model, storage, processing and visualization. This paper aims at designing a well suited graphical data storage model for biological information which is collected from major heterogeneous biological data repositories, by using graph database.
... Nowadays, all database management systems (DBMS) use B-tree indexes or hashing to accelerate data access operations [21]. Using proper indexes on some columns of the table substantially reduces the scope of search. ...
... – Linear probing Indexing (Bertino et al., 2012)To quickly locate data from voluminous amounts of dataset ...
Article
Big data is a potential research area receiving considerable attention from academia and IT communities. In the digital world, the amounts of data generated and stored have expanded within a short period of time. Consequently, this fast growing rate of data has created many challenges. In this paper, we use structuralism and functionalism paradigms to analyze the origins of big data applications and its current trends. This paper presents a comprehensive discussion on state-of-the-art big data technologies based on batch and stream data processing. Moreover, strengths and weaknesses of these technologies are analyzed. This study also discusses big data analytics techniques, processing methods, some reported case studies from different vendors, several open research challenges, and the opportunities brought about by big data. The similarities and differences of these techniques and technologies based on important parameters are also investigated. Emerging technologies are recommended as a solution for big data problems.
... Note that, we do not build hash index in our baselines, since we mainly work on relationship tables, each individual column in a relationship table is not a primary key and has many duplicates. For instance, to eliminate whole column scan, binary search can be utilized [12, 32] if the column is sorted. Since we adopt the virtual IDs store strategy, all the columns should be organized in one order. ...
Article
Full-text available
We study a class of graph analytics SQL queries, which we call relationship queries. These queries involving aggregation, join, semijoin, intersection and selection are a wide superset of fixed-length graph reachability queries and of tree pattern queries. We present real-world OLAP scenarios, where efficient relationship queries are needed. However, row stores, column stores and graph databases are unacceptably slow in such OLAP scenarios. We propose a GQ-Fast database, which is an indexed database that roughly corresponds to efficient encoding of annotated adjacency lists that combines salient features of column-based organization, indexing and compression. GQ-Fast uses a bottom-up fully pipelined query execution model, which enables (a) aggressive compression (e.g., compressed bitmaps and Huffman) and (b) avoids intermediate results that consist of row IDs (which are typical in column databases). GQ-Fast compiles query plans into executable C++ source code. Besides achieving runtime efficiency, GQ-Fast also reduces main memory requirements because, unlike column databases, GQ-Fast selectively allows dense forms of compression including heavy-weight compressions, which do not support random access. We used GQ-Fast to accelerate queries for two OLAP dashboards in the biomedical field. GQ-Fast outperforms PostgreSQL by 2--4 orders of magnitude and MonetDB, Vertica and Neo4j by 1--3 orders of magnitude when all of them are running on RAM. Our experiments dissect GQ-Fast's advantage between (i) the use of compiled code, (ii) the bottom-up pipelining execution strategy, and (iii) the use of dense structures. Other analysis and experiments show the space savings of GQ-Fast due to the appropriate use of compression methods. We also show that the runtime penalty incurred by the dense compression methods decreases as the number of CPU cores increases.
... Hadoop is a typical big data batch computing framework. HDFS distributed file system is responsible for the storage of static data [10]. The calculation logic is assigned to each data nodes, via MapReduce, for data computing. ...
Conference Paper
Full-text available
Micro-blogging is becoming an important information source of breaking news event. Since micro-blogs are real-time unbounded stream with complex relationships, traditional burst event detection techniques do not work well. This paper presents the RBEDS which is a real-time burst event detection system following Storm distributed streaming processing framework. K-Means clustering approach and burst feature detection approach are performed to identify candidate burst events, respectively. Their outputs are incorporated to generate final event detection results. Such operation is implemented as a Storm Topology. The proposed system is evaluated on a large Sina micro-blogging dataset. The achieved system performance shows that the RBEDS system may detect burst events with good timeliness, effectiveness and scalability.
... To tackle such types of tasks, distributed processing frameworks, e.g. MapReduce in [5] and database management systems (DMBS) in [12] have proved to be more popular. However, many researchers have shown that traditional DBMSs which uses a 'store-then-process' method of computation cannot provide acceptable latency needed in real-time stream processing applications [6]. ...
Article
Full-text available
Twitter is an online service that enables users to read and post tweets; thereby providing a wealth of information regarding breaking news stories. The problem of First Story Detection is to identify first stories about different events from streaming documents. The Locality sensitive hashing algorithm is the traditional approach used for First Story Detection. The documents have a high degree of lexical variation which makes First Story Detection a very difficult task. This work uses Twitter as the data source to address the problem of real-time First Story Detection. As twitter data contains a lot of spam, we built a dictionary of words to remove spam from the tweets. Further since the Twitter streaming data rate is high, we cannot use traditional Locality sensitive hashing algorithm to detect the first stories. We modify the Locality sensitive hashing algorithm to overcome this limitation while maintaining reasonable accuracy with improved performance. Also, we use Storm distributed platform, so that the system benefits from the robustness, scalability and efficiency that this framework offers.
... The golden age of spatio-temporal data management systems around the 90's lead to important data management systems, with data models, query languages, indexing and optimization techniques (Bertino et al. 1997). For example, (Tzouramanis et al. 1999) proposed an access method, overlapping linear quadtrees to store consecutive historical raster images, a database of evolving images and to support query processing. ...
Article
Full-text available
Pervasive computing is all about making information, data, and services available everywhere and anytime. The explosion of huge amounts of data largely distributed and produced by different means (sensors, devices, networks, analysis processes, more generally data services) and the requirements to have queries processed on the right information, at the right place, at the right time has led to new research challenges for querying. For example, query processing can be done locally in the car, on PDA's or mobile phones, or it can be delegated to a distant server accessible through Internet. Data and services can therefore be queried and managed by stationary or nomadic devices, using different networks. The main objective of this chapter is to present a general overview of existing approaches on query processing and the authors' vision on query evaluation in pervasive environments. It illustrates, with scenarios and practical examples, existing data and streams querying systems in pervasive environments. It describes the evaluation process of (i) mobile queries and queries on moving objects, (ii) continuous queries and (iii) stream queries. Finally, the chapter introduces the authors' vision of query processing as a service composition in pervasive environments.
... The importance of indexing (Bertino et al., 2012) in databases has been proven, particularly when data size is extremely large. It improves searching for desired data in large tables and helps in quickly locating data through bypassing the traversal of each and every row. ...
Article
Full-text available
Today, science is passing through an era of transformation, where the inundation of data, dubbed data deluge is influencing the decision making process. The science is driven by the data and is being termed as data science. In this internet age, the volume of the data as grown up to petabytes, and this large, complex, structured or unstructured, and heterogeneous data in the form of “Big Data” has gained significant attention. The rapid pace of data growth through various disparate sources, especially social media such as Facebook, has seriously challenged the data analytic capabilities of traditional relational databases. The velocity of the expansion of the amount of data gives rise to a complete paradigm shift in how new age data is processed. Confidence in the data engineering of the existing data processing systems is gradually fading whereas the capabilities of the new techniques for capturing, storing, visualizing, and analyzing data are evolving. In this review paper, we discuss some of the modern Big Data models that are leading contributors in the NoSQL era and claim to address Big Data challenges in reliable and efficient ways. Also, we take the potential of Big Data into consideration and try to reshape the original operational oriented definition of “Big Science” (Furner, 2003) into a new data-driven definition and rephrase it as “The science that deals with Big Data is Big Science.”
... Traditional data intensive tasks involve the batch processing of large static datasets using networks of multiple machines . To tackle these types of tasks, database management systems (DMBS) [5] and distributed processing frameworks, e.g. MapReduce [7] have proved to be popular. ...
Conference Paper
Social media streams, such as Twitter, have shown themselves to be useful sources of real-time information about what is happening in the world. Automatic detection and tracking of events identified in these streams have a variety of real-world applications, e.g. identifying and automatically reporting road accidents for emergency services. However, to be useful, events need to be identified within the stream with a very low latency. This is challenging due to the high volume of posts within these social streams. In this paper, we propose a novel event detection approach that can both effectively detect events within social streams like Twitter and can scale to thousands of posts every second. Through experimentation on a large Twitter dataset, we show that our approach can process the equivalent to the full Twitter Firehose stream, while maintaining event detection accuracy and outperforming an alternative distributed event detection system.
... Agrawal [3] R-Tree [4] MBR(Minimum Bounding Region) Liu [7] GPS Web Fig. 2Process of generating feature vector of moving point and feature matrix of trajectory data. ...
Article
Full-text available
In this paper, we propose a method of a content based similarity search of trajectory data sets for assisting decision of the users' best destination. Recently, many similarity search engines of trajectory data have been proposed by many researchers. However, the algorithms of these similarity search engines only deal with the position of the trajectory data. The algorithms use only physical locations for similarity search are not effective, because most users have interest of moving. We believe that the data of users' interest of moving are also essential to calculate similarities between trajectory data and users' best destination. In this paper, we propose a novel method for calculating similarities of trajectory data using textual metadata. We use descriptive document of the spot the user had stayed. In our proposed method, we use average function to integrate the users' trajectory data and use slope of the similarities. Consequently, we confirmed that the system can calculate accurate similarities for trajectory data. We also confirmed the precision of our proposed method for trajectory data.
... In a sense, the STL is largely a calculus of indexes. There are many varieties of index [16] , and each provides a different interface [4, 5, 19, 28]. The STL handles this diversity with subclassing; although a part of this interface is shared by all containers, each associative container subclass provides its own unique methods. ...
Article
In relational database systems the index is gener-ally a second-class construct: users cannot explic-itly use an index. (In fact, the keyword INDEX is not even defined in the SQL2 (SQL92) standard.) The principle that 'indexes should be used but not seen' has been followed for decades, and is often justified as necessary in order to avoid the com-plexities introduced by explicit access paths and navigational queries. We review arguments for and against this princi-ple, and for making the index a first-class con-struct in relational database systems. For large and complex databases, such as those arising in bioin-formatics, the second-class status of indexing can be in conflict with its importance. The case for a first-class index appears strongest for situations like these. We investigate ways to incorporate first-class in-dexing into the relational database model, sur-facing indexes as functionals. This investigation gives insights about the relational model, and also suggests ways for relational databases to support applications like bioinformatics, for which index-ing is of central importance.
... BitCube [24]). Indexing the structure of XML documents is related to indexing aggregation graphs and inheritance hierarchies in object-oriented databases [4]. Indices on aggregation graphs allow for indexing objects on values of nested objects. ...
Article
To cope with the increasing number and size of XML docu-ments, XML databases provide index structures to acceler-ate queries on the content and structure of documents. To adapt indices to the query workload, XML databases require various secondary index structures. This paper presents a generic index framework called sciens (Structure and Content Indexing with Extensible, Nestable Structures). In contrast to existing work on XML indexing, this framework can integrate arbitrary index structures and adapt them to dierent query requirements. It supports dening, accessing and maintaining indices without aecting query and update processing. By oering a great exibility of what to index, the framework allows for processing queries more eciently.
Book
Full-text available
This book presents the peer-reviewed proceedings of the 2nd International Conference on Computational and Bioengineering (CBE 2020) jointly organized in virtual mode by the Department of Computer Science and the Department of BioScience & Sericulture, Sri Padmavati Mahila Visvavidyalayam (Women's University), Tirupati, Andhra Pradesh, India, during 4–5 December 2020. The book includes the latest research on advanced computational methodologies such as artificial intelligence, data mining and data warehousing, cloud computing, computational intelligence, soft computing, image processing, Internet of things, cognitive computing, wireless networks, social networks, big data analytics, machine learning, network security, computer networks and communications, bioinformatics, biocomputing/biometrics, computational biology, biomaterials, bioengineering, and medical and biomedical informatics.
Book
Full-text available
This book presents the peer-reviewed proceedings of the 2nd International Conference on Computational and Bioengineering (CBE 2020) jointly organized in virtual mode by the Department of Computer Science and the Department of BioScience & Sericulture, Sri Padmavati Mahila Visvavidyalayam (Women's University), Tirupati, Andhra Pradesh, India, during 4–5 December 2020. The book includes the latest research on advanced computational methodologies such as artificial intelligence, data mining and data warehousing, cloud computing, computational intelligence, soft computing, image processing, Internet of things, cognitive computing, wireless networks, social networks, big data analytics, machine learning, network security, computer networks and communications, bioinformatics, biocomputing/biometrics, computational biology, biomaterials, bioengineering, and medical and biomedical informatics.
Article
Frequent pattern mining is an essential data-mining task, with a goal of discovering knowledge in the form of repeated patterns. Many efficient pattern-mining algorithms have been discovered in the last two decades, yet most do not scale to the type of data we are presented with today, the so-called “Big Data”. Scalable parallel algorithms hold the key to solving the problem in this context. This paper reviews recent advances in parallel frequent pattern mining, analysing them through the Big Data lens. Load balancing and work partitioning are the major challenges to be conquered. These challenges always invoke innovative methods to do, as Big Data evolves with no limits. The biggest challenge than before is conquering unstructured data for finding frequent patterns. To accomplish this Semi Structured Doc-Model and ranking of patterns are used.
Chapter
In this chapter, the authors discuss two important trends in modern software engineering (SE) regarding the utilization of knowledge management (KM) and information retrieval (IR). Software engineering is a discipline in which knowledge and experience, acquired in the course of many years, play a fundamental role. For software development organizations, the main assets are not manufacturing plants, buildings, and machines, but the knowledge held by their employees. Software engineering has long recognized the need for managing knowledge and that the SE community could learn much from the KM community. The authors introduce the fundamental concepts of KM theory and practice and mainly discuss the aspects of knowledge management that are valuable to software development organizations and how a KM system for such an organization can be implemented. In addition to knowledge management, information retrieval (IR) also plays a crucial role in SE. IR is a study of how to efficiently and effectively retrieve a required piece of information from a large corpus of storage entities such as documents. As software development organizations grow larger and have to deal with larger numbers (probably millions) of documents of various types, IR becomes an essential tool for retrieving any piece of information that a software developer wants within a short time. IR can be used both as a general-purpose tool to improve the productivity of developers or as an enabler tool to facilitate a KM system.
Thesis
La tâche d'exploration dans des ressources inexploitées mais nouvellement numérisées, afin d'y trouver des informations pertinentes, est complexifiée par la quantité de ressources disponibles. Grâce au projet ANR CIRESFI, la ressource la plus importante, pour la Comédie-Italienne du XVIIIe siècle, est un ensemble de registres comptables constituée de 28 000 pages. L'extraction d'informations est un processus long et complexe qui demande une expertise à chaque étape : détection et segmentation, extraction de caractéristiques, reconnaissance d’écriture manuscrite. Les systèmes à base de réseaux de neurones profonds dominent dans l'ensemble ces approches. Le problème majeur est qu'ils nécessitent d'avoir une grande quantité de données pour réaliser leur apprentissage. Cependant, les registres de la Comédie-Italienne ne possèdent pas de vérité terrain. Pour palier ce manque de données, nous explorons des approches pouvant opérer un apprentissage par transfert de connaissance. Cela signifie utiliser un ensemble de données déjà étiquetées et disponibles, possédant un minimum de points communs avec nos données pour entraîner les systèmes, pour ensuite les appliquer sur nos données. L'ensemble de nos expérimentations nous ont montré la difficulté de réaliser cette tâche, chaque choix à chaque étape ayant un impact fort sur la suite du système. Nous convergeons vers une solution séparant le modèle optique du modèle de langage afin de réaliser un apprentissage indépendant avec différents types de ressources disponibles et se rejoignant grâce à une projection de l'ensemble des informations dans un espace commun non-latent.
Chapter
Multi-dimensional data indexing has received much research attention recently in a centralized system. However, it remains a nascent area of research in providing an integrated structure for multiple queries on multi-dimensional data in a distributed environment. We propose a new data structure, called BR-tree (Bloom filter based R-tree), and implement such a prototype in the context of a distributed system. The node in a BR-tree, viewed as an expansion from the traditional R-tree node structure, incorporates space-efficient Bloom filters to facilitate fast membership queries. The proposed BR-tree can simultaneously support not only existing point and range queries but also cover and bound queries that can potentially benefit various data indexing services. Compared with previous data structures, BR-tree achieves space efficiency and provides quick response (\(\le O(log~n)\)) on these four types of queries. Our extensive experiments in a distributed environment further validate the practicality and efficiency of the proposed BR-tree structure (©{2009}IEEE. Reprinted, with permission, from Ref. [1].).
Article
A spatial-temporal data structure, called the PMD-tree (Persistent Multi-Dimensional tree), has been proposed for managing the live intervals and locations of spatial objects. In the paper, novel concepts of time space bounding box (TSBB) and motion list are introduced to the PMD-tree to manage moving spatial objects efficiently. A TSBB is an extended bounding box for a moving object that covers the trajectory of the object. As an object moves, a TSBB corresponding to the object is enlarged to enclose the trajectory of the object. A TSBB is divided, when it becomes greater than a limit. An object and corresponding TSBBs are managed by a doubly connected linked list, called motion list. TSBBs are also managed by the PMD-tree. Introducing the concept of TSBB and motion list to the PMD-tree, moving objects can be efficiently managed and be quickly found for spatial-temporal queries. By the series of simulation tests, the storage requirements and search performances are evaluated for several types of moving objects. As a result, the proposed method is superior to the conventional methods.
Conference Paper
For the rapid economic and social development, in China, there are tens of thousands of construction projects launched each year. During construction and operation periods, projects will be more or less destroy or impact environment of projects site and surrounding area. In China, project level environmental impact assessment (PEIA) has been executed more than ten years in terms of a state law in order to prevent or mitigate environmental impact of projects. PEIA is a data-driven or data-intensive research work. On one hand, it needs large amounts of dataset support. On the other hand, PEIA will produce abundant data and documents. Therefore, it is urgent to build PEIA database to standardize, integrate, and preserve these data in the long term, and thus to promote their wide sharing and usage. For the successful PEIA database building, database design is a crucial issue that decides what data elements and their attributes should be included and thus influences the range of application of the database. For traditional database design measures, it is hard to figure out all domain entities or objects and represent complex semantic relationships behind entities or objects. In this paper, we propose an ontology-based design method and mapping ontology to the Entity-Relationship (ER) model method for PEIA database. Based on ontology, all concepts of PEIA and their attributes as well as semantic relationships can be clearly and completely designed and transformed to the ER model. By this method, we have built National PEIA of China (NPEIA) that has integrated amounts of basic supporting datasets, such as geo-spatial data, environmental sensitive areas data and so on and more than 100,000 PEIA records that covers 13 different industries projects. With the development of the big data mining and linked open data, in the future we will focus on enriching and opening PEIA database and linking it with SEA (strategic environmental assessment) and PEA (planning environmental assessment). Data deep mining and analysis based on NPEIA and other related datasets are also our research emphasis to support national industrial structural adjustment and total amount control of pollution.
Chapter
Object-oriented and multimedia databases that are entangled with the manipulation of multi-valued attributes and complex objects in general, tend to raise problems that involve particular properties and needs, such as the constrained domain of set-valued attributes in contrast to the nonconstrained one of text-databases, and the superset, perfect match and subset queries that are frequently asked. The focus of this chapter is on the advantages that could arise from the creation of a hybrid structure handling set-valued attributes in the form of signatures, based on the structures of hashing and the S-tree.
Chapter
Introduction Data Model and System Overview Replication and Load Balancing Evaluation Related Work Summary References
Article
Today the data in the world has reached beyond the sky limits and with the advancement of data-intensive applications there is a need to collect, analyze, process, and retrieve enormous datasets efficiently. This large datasets are popularly termed as ‘‘BIG DATA’’ which was coined by Roger Magoulas, director of market research at O’Reilly Media. To deal with these large datasets different approaches by various data scientists around the world grew and as a result scalable effectuations of information retrieval (IR) operations have become a necessity. MapReduce [1, 2] programming model (Apache’s Hadoop [3], an open source implementation of MapReduce) has emerged as a very effective tool to handle large volume of data in distributed environment. Here with our work we are extending the technique of indexing large data using Single-Pass with hash implementation over MapReduce framework.
Article
In this book chapter, the authors discuss two important trends in modern software engineering (SE) regarding the utilization of knowledge management (KM) and information retrieval (IR). Software engineering is a discipline in which knowledge and experience, acquired in the course of many years, play a fundamental role. For software development organizations, the main assets are not manufacturing plants, buildings, and machines, but the knowledge held by their employees. Software engineering has long recognized the need for managing knowledge and that the SE community could learn much from the KM community. The authors introduce the fundamental concepts of KM theory and practice and mainly discuss the aspects of knowledge management that are valuable to software development organizations and how a KM system for such an organization can be implemented. In addition to knowledge management, information retrieval (IR) also plays a crucial role in SE. IR is a study of how to efficiently and effectively retrieve a required piece of information from a large corpus of storage entities such as documents. As software development organizations grow larger and have to deal with larger numbers (probably millions) of documents of various types, IR becomes an essential tool for retrieving any piece of information that a software developer wants within a short time. IR can be used both as a general-purpose tool to improve the productivity of developers or as an enabler tool to facilitate a KM system.
Article
In this work, we developed FindMeDIA, the purpose of which is to preserve the Moroccan cultural heritage in the form of movies and photographs. Therefore, we had to manage an joint image and video database. Since a video can be perceived as a sequence of fixed images, we have been treating a video as an extension which depends on the modelling of images in a quasi-transparent way. Our main objective was, on the one hand, to meet the genericity and flexibility needs allowing to navigate with different types of visual data, and, on the other hand, to put forward a system which enables us to navigate by moving indistinctly between images and videos. As far as the modelling of video is concerned, we proposed FindViDEO. Both in its model and metamodel parts, FindViDEO is flexible and includes a large spectrum of applications and pre-existing models. For the sake of navigation, we reused a Galois' lattice technique on a database composed of still images and key-frames extracted from videos. The resulting FindMeDIA system is generic and enables us to use many image description techniques for navigation. To test the interest of these approaches, the modelling of key-frames (extracted from videos) as well as still images is carried out by ClickImAGE which proposes a semi-structured representation of data based on the content of images.
Article
Recent trends in information technologies result in a massive proliferation of data which are carried over different kinds of networks, produced in either on-demand or streaming fashion, generated and accessible by a variety of devices, and that can involve mobility aspects. This thesis presents an approach for the evaluation of hybrid queries that integrate the various aspects involved in querying continuous, mobile and hidden data in dynamic environments. Our approach consists of representing such an hybrid query as a service coordination comprising data and computation services. A service coordination is specified by a query workflow and additional operator workflows. A query workflow represents an expression built with the operators of our data model. This workflow is constructed from a query specified in our proposed SQL-like query language, HSQL, by an algorithm we developed based on known results of database theory. Operator workflows enable to compose computation services to enable the evaluation of a particular operator. HYPATIA, a service-based hybrid query processor, implements and validates our approach.
Conference Paper
Indexing of data is important for the fast query response in the information retrieval. Support of multiple query on the multidimensional data is a challenging task. Indexing of multidimensional data received much attention recently. In this paper a new data structure Perfect Hash Base R-tree (PHR-tree) is proposed. Node of PHR-tree is expansion of traditional R-tree node with Perfect Hashing Index to support multiple queries efficiently. It supports point query on the multidimensional data efficiently. It provides space efficiency and fast response to query (O(log n)) on all type of queries.
Article
We present a simple geometric framework for the relational join. Using this framework, we design an algorithm that achieves the fractional hypertree-width bound, which generalizes classical and recent worst-case algorithmic results on computing joins. In addition, we use our framework and the same algorithm to show a series of what are colloquially known as beyond worst-case results. The framework allows us to prove results for data stored in Btrees, multidimensional data structures, and even multiple indices per table. A key idea in our framework is formalizing the inference one does with an index as a type of geometric resolution; transforming the algorithmic problem of computing joins to a geometric problem. Our notion of geometric resolution can be viewed as a geometric analog of logical resolution. In addition to the geometry and logic connections, our algorithm can also be thought of as backtracking search with memoization.
Article
Determinations of fracture network connections would help the investigators remove those “meaningless” no-flow-passing fractures, providing an updated and more effective fracture network that could considerably improve the computation efficiency in the pertinent numerical simulations of fluid flow and solute transport. The effective algorithms with higher computational efficiency are needed to accomplish this task in large-scale fractured rock masses. A new approach using R tree indexing was proposed for determining fracture connection in 3D stochastically distributed fracture network. By comparing with the traditional exhaustion algorithm, it was observed that from the simulation results, this approach was much more effective; and the more the fractures were investigated, the more obvious the advantages of the approach were. Furthermore, it was indicated that the runtime used for creating the R tree indexing has a major part in the total of the runtime used for calculating Minimum Bounding Rectangles (MBRs), creating the R tree indexing, precisely finding out fracture intersections, and identifying flow paths, which are four important steps to determine fracture connections. This proposed approach for the determination of fracture connections in three-dimensional fractured rocks are expected to provide efficient preprocessing and critical database for practically accomplishing numerical computation of fluid flow and solute transport in large-scale fractured rock masses.
Conference Paper
Full-text available
Different models have been proposed recently for representing temporal data, tracking historical information and retrieving temporal queries results efficiently. We consider the problem of indexing temporal XML documents. In particular, we propose an indexing scheme that uses a summary structure and a matrix that captures the structural relationships as well as time intervals inside a temporal XML document. We introduce an algorithm to efficiently process all types of temporal queries with any depth using our newly proposed index. We show that our proposed index out-performs the state of the art indices in terms of both query processing time and support for different temporal query types.
Article
A major performance goal of a DBMS is to minimize the number of I/O’s (i.e., blocks or pages transferred) between the disk and main memory. One way to achieve this goal is to minimize the number of I/O’s when answering a query. Note that many queries reference only a small portion of the records in a database file. For example the query: “find the employees who reside in Santa Monica” references only a fraction of the records in the Employee relation. It would be very inefficient to have the database system sequentially read all the pages of the Employee file and check the residence field of each employee record for the name “Santa Monica”. Instead the system should be able to locate the pages with “Santa Monica” employee records directly. To allow such fast access additional data structures called access methods are designed per database file. There are two fundamental access methods, namely indexing and hashing. The most widely used indexing scheme is the B+-tree. Hashing is also common, in particular in its Extendible and Linear Hashing schemes. We also describe two multi-attribute access methods, the k-d tree and the Grid File. Finally, we discuss an approach that is popular for document searching, the Inverted File.
Article
Full-text available
LH*RS is a high-availability scalable distributed data structure (SDDS). An LH*RS file is hash partitioned over the distributed RAM of a multicomputer, for example, a network of PCs, and supports the unavailability of any k ≥ 1 of its server nodes. The value of k transparently grows with the file to offset the reliability decline. Only the number of the storage nodes potentially limits the file growth. The high-availability management uses a novel parity calculus that we have developed, based on Reed-Salomon erasure correcting coding. The resulting parity storage overhead is about the lowest possible. The parity encoding and decoding are faster than for any other candidate coding we are aware of. We present our scheme and its performance analysis, including experiments with a prototype implementation on Wintel PCs. The capabilities of LH*RS offer new perspectives to data intensive applications, including the emerging ones of grids and of P2P computing.
Article
With the rapid rise in site-specific data collection, many research efforts have been directed towards finding optimal sampling and analysis procedures. However, the absence of widely available high quality precision agriculture data sets makes it difficult to compare results from separate experiments and to assess the optimality and applicability of procedures. To provide a tool for spatial data experimentation, we have developed a spatial data generator that allows users to produce data layers with given spatial properties and a response variable (e.g. crop yield) dependent upon user specified functions. Differences in response functions within fields can be simulated by assigning different models to regions in coordinate-(x and y) or feature space (multidimensional space of attributes that may have an influence on response). Noise, either unexplained variance or sensor error, can be added to all spatial layers. Sampling and interpolation error is modeled by sampling a continuous data layer and interpolating values at unsampled locations. The program has been successfully tested for up to 15000 grid points, 10 features and 5 models. As an illustration of the potential uses of generated data, the effect of sampling density and kriging interpolation on neural network prediction of crop yield was assessed. Yield prediction accuracy was highly related (correlation coefficient 0.98) to the accuracy of the interpolated layers indicating that unless data are sampled at very high densities relative to their geostatistical properties, one should not attempt to build highly accurate regression functions using interpolated data. By allowing users to generate large amounts of data with controlled complexity and features, the spatial data generator should facilitate the development of improved sampling and analysis procedures for spatial data.
Article
Providing efficient query processing in database systems is one step towards gaining acceptance of such systems by end users. We propose several techniques for indexing fuzzy sets in databases to improve the query evaluation performance. Three of the presented access methods are based on superimposed coding, while the fourth relies on inverted files. The efficiency of these techniques was evaluated experimentally. We present results from these experiments, which clearly show the superiority of the inverted files.
ResearchGate has not been able to resolve any references for this publication.