Book

Indexing Techniques for Advanced Database Systems

January 1997

January 1997

DOI:10.1007/978-1-4615-6227-6

Source
DBLP

Publisher: Kluwer
ISBN: 0-7923-9985-4

Authors:

Elisa Bertino

Purdue University

Show all 7 authorsHide

Recent years have seen an explosive growth in the use of new database applications such as CAD/CAM systems, spatial information systems, and multimedia information systems. The needs of these applications are far more complex than traditional business applications. They call for support of objects with complex data types, such as images and spatial objects, and for support of objects with wildly varying numbers of index terms, such as documents. Traditional indexing techniques such as the B-tree and its variants do not efficiently support these applications, and so new indexing mechanisms have been developed. As a result of the demand for database support for new applications, there has been a proliferation of new indexing techniques. The need for a book addressing indexing problems in advanced applications is evident. For practitioners and database and application developers, this book explains best practice, guiding the selection of appropriate indexes for each application. For researchers, this book provides a foundation for the development of new and more robust indexes. For newcomers, this book is an overview of the wide range of advanced indexing techniques. Indexing Techniques for Advanced Database Systems is suitable as a secondary text for a graduate level course on indexing techniques, and as a reference for researchers and practitioners in industry.

Object-Oriented Databases

Chapter

Jul 2011
Indexing Techniques for Advanced Database Systems
pp.1-38

There has been a growing acceptance of the object-oriented data model as the basis of next generation database management systems (DBMSs). Both pure object-oriented DBMS (OODBMSs) and object-relational DBMS (ORDBMSs) have been developed based on object-oriented concepts. Object-relational DBMS, in particular, extend the SQL language by incorporating all the concepts of the object-oriented data model. A large number of products for both categories of DBMS is today available. In particular, all major vendors of relational DBMSs are turning their products into ORDBMSs [Nori, 1996].

Spatial Databases

Chapter

Jan 1997
Indexing Techniques for Advanced Database Systems
pp.39-75

Many applications (such as computer-aided design (CAD), geographic information systems (GIS), computational geometry and computer vision) operate on spatial data. Generally speaking, spatial data are associated with spatial coordinates and extents, and include points, lines, polygons and volumetric objects.

Image Databases

Book

Jan 2002
John Wiley

Images have always been an essential and effective medium for presenting visual data. With advances in today’s computer technologies, it is not surprising that in many applications, much of the data is images. In medical applications, images such as X-rays, magnetic resonance images and computer tomography images are frequently generated and used to support clinical decision making. In geographic information systems, maps, satellite images, demographics and even tourist information are often processed, analyzed and archived. In police department criminal databases, images like fingerprints and pictures of criminals are kept to facilitate identification of suspects. Even in offices, information may arrive in many different forms (memos, documents, and faxes) that can be digitized electronically and stored as images.

Temporal Databases

Chapter

Jan 1997
Indexing Techniques for Advanced Database Systems
pp.113-149

Apart from some primary keys and keys that rarely change, many attributes evolve and take new values over time. For example, in an employee relation, employees’ titles may change as they take on new responsibilities, as will their salaries as a result of promotion or increment. Traditionally, when data is updated, its old copy is discarded and the most recent version is captured. Conventional databases that have been designed to capture only the most recent data are known as snapshot databases. With the increasing awareness of the values of the history of data, maintenance of old versions of records becomes an important feature of database systems.

Text Databases

Chapter

Jul 2011
Indexing Techniques for Advanced Database Systems
pp.151-183

Text databases provide rapid access to collections of digital documents. Such databases have become ubiquitous: text search engines underlie the online text repositories accessible via the Web and are central to digital libraries and online corporate document management.

Emerging Applications

Article

Jan 1997

Because performance is a crucial issue in database systems, indexing techniques have always been an area of intense research and development. Advances in indexing techniques are primarily driven from the need to support different data models, such as the object-oriented data model, and different data types, such as image and text data. However, advances in computer architectures may also require significant extensions to traditional indexing techniques. Such extensions are required to fully exploit the performance potential of new architectures, such as in the case of parallel architectures, or to cope with limited computing resources, such as in the case of mobile computing systems. New application areas also play an important role in dictating extensions to indexing techniques and in offering wider contexts in which traditional techniques can be used.

Real time and spatiotemporal data indexing for sensor based databases

Article

Full-text available

Jan 2008

Reduced Curse of dimensionality Using VA-file

Article

We consider approaches for similarity search in correlated, high-dimensional data-sets, which are derived within a clustering framework. We note that indexing by "vector approximation" (VA-File), which was proposed as a technique to combat the "Curse of Dimensionality", employs scalar quantization, and hence necessarily ignores dependencies across dimensions, which represents a source of sub optimality. Clustering, on the other hand, exploits inter-dimensional correlations and is thus a more compact representation of the data-set. While clustering has a long history and a large number of clustering techniques have been developed in statistics, pattern recognition, data mining, and other fields, significant challenges still remain Recently, a vector approximation based technique called VA-File has been proposed for indexing high dimensional data. It has been shown that the VA-file is an effective technique compared to the current approaches based on space and data partitioning. The VA-file gives good performance especially when the data set is uniformly distributed. Real data sets are not uniformly distributed, are often clustered, and the dimensions of the feature vectors in real datasets are usually correlated. the VA-File, over a wide range of quantization resolutions, it is able to reduce random IO accesses, given (roughly) the same amount of sequential IO operations, by factors reaching 100X and more.

Indexation multidimensionnelle de bases de données capteur temps-réel et spatiotemporelles

Article

Full-text available

Aug 2005

Les systèmes de base de données capteurs sont de plus en plus fréquemment utilisés pour la surveillances de milieux à risques. Ces systèmes sont usuellement composés d'un ensemble de capteurs envoyant les mesures effectuées vers une base de données centralisée. La fréquence des mesures aussi bien que les besoins des utilisateurs imposent à la base des contraintes temps-réel douces, tout en valorisant les données les plus récentes. Quant au besoin semi-généralisé d'accéder aux données en fonction de critères spatiaux, il impose à son tour des caractéristiques spatiotemporelles. Afin de répondre aux spécificités de ces systèmes, cet article propose deux méthodes d'indexation de données. La première, dédiée à l'indexation d'un grand nombre de données issues de capteurs fixes se nomme le PoTree. Son évolution, le PasTree, privilégie la gestion de l'agilité des capteurs et l'augmentation des méthodes d'interrogation. ABSTRACT. More and more risk monitoring systems use sensor databases. These systems usually consist of an array of sensors sending their measurements toward a central database. The measurement frequency as well as the user requirements impose soft real-time constraints to the database, and tend to set the focus on the newest data. As for the need to access the data through spatial criterium, it adds spatiotemporal specificities. So as to meet these requirements, this paper propose two data access methods. The first one, dedicated to systems with a high number of data from fixed sensors is named the PoTree. Its evolution, the PasTree, focuses on sensor agility management and adding new querying patterns.

Soft Real-Time GIS for disaster monitoring

Article

Full-text available

Jan 2005

The goal of this paper is to underline the importance of real-time systems for managing information during the phase of disaster monitoring. We stress the importance of soft real-time GIS, and we present a list of barriers to overcome in order to get this kind of system. Among the barriers, we present a solution for real-time indexing of spatio-temporal data based on a data structure named PO-Tree.

Bases de Datos de Imágenes: Arquitectura de los Sistemas de Recuperación de Imágenes Basados en Contenido.

Article

Full-text available

Sep 2005

Resumen En este trabajo se presenta en primer lugar una rápida revisión de las bases de datos de imágenes. Se exponen las razones que ocasionaron su origen y evolución, agrupándolas en dos categorías según su funcionalidad. A continuación se examinan los distintos tipos de Bases de Datos de Imágenes comentando brevemente sus características y estableciendo una división entre los sistemas considerados como sencillos y los Sistemas de Recuperación de Imágenes Basados en el Contenido (SRIBC). Posteriormente se presenta una arquitectura, debida a Bertino, describiendo brevemente los distintos módulos y los principales procesos que incorporan los SRIBC. También se comentan brevemente otras arquitecturas recientemente aparecidas en diferentes artículos. Se propone una nueva arquitectura más general que las anteriores y que da respuesta a los distintos métodos de consulta utilizados en los SRIBC actuales. Se explica el funcionamiento de los distintos módulos de esta arquitectura, tanto en la fase de poblamiento como en la de consulta. Se expone brevemente sus principales etapas así como las posibles rutas de datos que sigue la información según sea el método de consulta empleado. Palabras clave: Bases de Datos de Imágenes, arquitecturas de Bases de Datos de Imágenes, Sistemas de Recuperación de Imágenes Basados en el Contenido, procesamiento de imágenes.

Delivering time-evolving 3D city models for web visualization

Article

Full-text available

Apr 2020

Studying and planning urban evolution is essential for understanding the past and designing the cities of the future and can be facilitated by providing means for sharing, visualizing, and navigating in cities, on the web, in space and in time. Standard formats, methods, and tools exist for visualizing large-scale 3D cities on the web. In this paper, we go further by integrating the temporal dimension of cities in geospatial web delivery standard formats. In doing so, we enable interactive visualization of large-scale time-evolving 3D city models on the web. A key characteristic of this paper lies in the proposed four-step generic approach. First, we design a generic conceptual model of standard formats for delivering 3D cities on the web. Then, we formalize and integrate the temporal dimension of cities into this generic conceptual model. Following which, we specify the conceptual model in the 3D Tiles standard at logical and technical specification levels, resulting in an extension of 3D Tiles for delivering time-evolving 3D city models on the web. Finally, we propose an open-source implementation, experiments, and an evaluation of the propositions and visualization rules. We also provide access to reproducibility notes allowing researchers to replicate all the experiments.

An Extended Classification and Comparison of NoSQL Big Data Models

Article

Full-text available

Sep 2015

Sugam Sharma

Today, in data science, the term Big Data has attracted a large set of audience from various diversified research and industries, who sniff the potential of Big Data in solving the complex problems. In this age, the decision making processes are largely data dependent. Though, the concept of Big Data is in the midst of the evolution with great research and business opportunities, the challenges are enormous and growing equally too, starting from data collection up to decision making. This motivates various scientific disciplines to conglomerate their efforts for deep exploration of all dimensions of Big Data to procure evolutionary outcomes. The considerable velocity of the volume expansion and variety of the data pose the serious challenges to the existing data processing systems. Especially, in last few years, the volume of the data has grown manyfold. The data storages have been inundated by various disparate potential data outlets, leading by social media such as Facebook, Twitter, etc. The existing data models are largely unable to illuminate the full potential of Big Data; the information that may serve as the key solution to several complex problems is left unexplored. The existing computation capacity falls short for the increasingly expanded storage capacity. The fast-paced volume expansion of the unorganized data entails a complete paradigm shift in new age data computation and witnesses the evolution of new capable data engineering techniques such as capture, curation, visualization, analyses, etc. In this paper, we provide the first level classification for modern Big Data models. Some of the leading representatives of each classification that claim to best process the Big Data in reliable and efficient way are also discussed. Also, the classification is further strengthened by the intra-class and inter-class comparisons and discussions of the undertaken Big Data models.

Classification and Comparison of Leading NoSQL Big Data Models

Article

Full-text available

May 2015

Today, in data science, the term Big Data has attracted a large set of audience from various diversified directly or distantly related domains, in research and industry both, who sniff the potential of Big Data in solving the large and complex problems. In this data science age, the decision making processes are largely data dependent. Though, the concept of Big Data is in the midst of the evolution with great research and business opportunities, the challenges are enormous and growing equally too, starting from data collection up to decision making. This motivates various scientific disciplines to conglomerate their efforts for deep exploration of all dimensions of Big Data to procure evolutionary outcomes. The considerable velocity of the volume expansion and variety of the data pose the serious challenges to the existing data processing systems. Especially, in last few years, the volume of the data has grown manyfold (beyond petabytes). The data storages have been inundated by various disparate potential data outlets, leading by social media such as Facebook, Twitter, etc. The existing data models are largely unable to illuminate the full potential of Big Data; the information that may serve as the key solution to several complex problems is left unexplored. The existing computation capacity falls short for the increasingly expanded storage capacity. The fast-paced volume expansion of the unorganized data entails a complete paradigm shift in new age data computation and witnesses the evolution of new capable data engineering techniques such as capture, curation, visualization, analyses, etc. In this paper, we provide the first level classification for modern Big Data models. Some of the leading NoSQL (largely being translated as “not only SQL”) representatives of each classification that claim to best process the Big Data in reliable and efficient way are also discussed. Also, the classification is further strengthened by the intra-class and interclass comparisons and discussions of the undertaken NoSQL Big Data models.

Optimizing Correlated Path Queries in XML Languages

Article

Path expressions are ubiquitous in XML processing languages such as XPath, XQuery, and XSLT. Expressions in these languages typically include multiple path expressions, some of them correlated. Existing approaches evaluate these path expressions one-at-a-time and miss the optimization opportunities that may be gained by exploiting the correlations among them. In this paper, we address the evaluation and optimization of correlated path expressions. In particular, we propose two types of optimization techniques: integrating correlated path expressions into a single pattern graph, and rewriting the pattern graph according to a set of rewriting rules. The first optimization technique allows the query optimizer to choose an execution plan that is impossible by using the existing approaches. The second optimization technique rewrites pattern graphs at a logical level and produce a set of equivalent pattern graphs from which a physical optimizer can choose given an appropriate cost function. Under certain conditions that we identify, the graph pattern matching-based execution approach that we propose may be more efficient than the join-based approaches.

Development of Database Structure and Indexing Technique for the Wireless Response System

Conference Paper

Full-text available

Nov 2013

The relevance of this paper to the topics of the conference is that this paper provides a contribution to the area of systems communications, since this paper participates in the improvement of the Wireless Response Systems (WRS). In more detail, the rapid development of computer and wireless technologies improves many aspects of daily life. The objective of this research is to develop a database for the WRS in order to gain an efficient, fast, and reliable database management system. Furthermore, this research proposes a generic database structure for the Wireless Response System. Moreover, it investigates and studies advanced database indexing techniques and then performs a comparison between them. Subsequently, this work makes an argument to find out the most appropriate indexing technique for the WRS. Consequently, this research has achieved a great deal of success and has met the objectives and aims. To conclude, a framework for the Wireless Response System database has been developed. Additionally, the B+ Tree and Hash indexing techniques have been examined successfully. Thus, it is found that the B+ Tree is a powerful technique for this particular system.

Transforming big data into smart data: An insight on the use of the k-nearest neighbors algorithm to obtain quality data

Article

Full-text available

Nov 2018

The k‐nearest neighbors algorithm is characterized as a simple yet effective data mining technique. The main drawback of this technique appears when massive amounts of data—likely to contain noise and imperfections—are involved, turning this algorithm into an imprecise and especially inefficient technique. These disadvantages have been subject of research for many years, and among others approaches, data preprocessing techniques such as instance reduction or missing values imputation have targeted these weaknesses. As a result, these issues have turned out as strengths and the k‐nearest neighbors rule has become a core algorithm to identify and correct imperfect data, removing noisy and redundant samples, or imputing missing values, transforming Big Data into Smart Data—which is data of sufficient quality to expect a good outcome from any data mining algorithm. The role of this smart data gleaning algorithm in a supervised learning context are investigated. This includes a brief overview of Smart Data, current and future trends for the k‐nearest neighbor algorithm in the Big Data context, and the existing data preprocessing techniques based on this algorithm. We present the emerging big data‐ready versions of these algorithms and develop some new methods to cope with Big Data. We carry out a thorough experimental analysis in a series of big datasets that provide guidelines as to how to use the k‐nearest neighbor algorithm to obtain Smart/Quality Data for a high‐quality data mining process. Moreover, multiple Spark Packages have been developed including all the Smart Data algorithms analyzed. This article is categorized under: • Technologies > Data Preprocessing • Fundamental Concepts of Data and Knowledge > Big Data Mining • Technologies > Classification

Designing Graphical Data Storage Model for Gene-Protein and Gene-Gene Interaction Networks

Article

Full-text available

May 2017

Graph is an expressive way to represent dynamic and complex relationships in highly connected data. In today’s highly connected world, general purpose graph databases are providing opportunities to experience benefits of semantically significant networks without investing on the graph infrastructure. Examples of prominent graph databases are Neo4j, Titan, and OrientDB etc. In biological OMICS landscape, Interactomics is one of the new disciplines that focuses mainly on the data modeling, data storage, and retrieval of biological interaction data. Biological experiments generate the prodigious amount of data in various formats(semi-structured or unstructured). The large volume of such data posses challenges for data acquisition, data integration, multiple data modalities (either data model of storage model, storage, processing, and visualization. This paper aims at designing a well suited graphical data storage model for biological information which is collected from major heterogeneous biological data repositories, by using graph database

Designing Graphical Data Storage Model for Gene-Protein and Gene-Gene Interaction Networks

Article

Full-text available

May 2017

Graph is an expressive way to represent dynamic and complex relationships in highly connected data. In today’s highly connected world, general purpose graph databases are providing opportunities to experience benefits of semantically significant networks without investing on the graph infrastructure. Examples of prominent graph databases are: Neo4j, Titan and OrientDB etc. In biological OMICS landscape, Interactomics is one of the new disciplines that focuses mainly on the data modeling, data storage and retrieval of biological interaction data. Biological experiments generate prodigious amount of data in various formats(semi-structured or unstructured). The large volume of such data posses challenges for data acquisition, data integration, multiple data modalities (either data model of storage model, storage, processing and visualization. This paper aims at designing a well suited graphical data storage model for biological information which is collected from major heterogeneous biological data repositories, by using graph database.

Toward Efficient and Flexible Metadata Indexing of Big Data Systems

Article

Dec 2016

Big Data: From Beginning to Future

Article

Dec 2016
INT J INFORM MANAGE

Big data is a potential research area receiving considerable attention from academia and IT communities. In the digital world, the amounts of data generated and stored have expanded within a short period of time. Consequently, this fast growing rate of data has created many challenges. In this paper, we use structuralism and functionalism paradigms to analyze the origins of big data applications and its current trends. This paper presents a comprehensive discussion on state-of-the-art big data technologies based on batch and stream data processing. Moreover, strengths and weaknesses of these technologies are analyzed. This study also discusses big data analytics techniques, processing methods, some reported case studies from different vendors, several open research challenges, and the opportunities brought about by big data. The similarities and differences of these techniques and technologies based on important parameters are also investigated. Emerging technologies are recommended as a solution for big data problems.

Fast In-Memory SQL Analytics on Graphs

Article

Full-text available

Jan 2016

We study a class of graph analytics SQL queries, which we call relationship queries. These queries involving aggregation, join, semijoin, intersection and selection are a wide superset of fixed-length graph reachability queries and of tree pattern queries. We present real-world OLAP scenarios, where efficient relationship queries are needed. However, row stores, column stores and graph databases are unacceptably slow in such OLAP scenarios. We propose a GQ-Fast database, which is an indexed database that roughly corresponds to efficient encoding of annotated adjacency lists that combines salient features of column-based organization, indexing and compression. GQ-Fast uses a bottom-up fully pipelined query execution model, which enables (a) aggressive compression (e.g., compressed bitmaps and Huffman) and (b) avoids intermediate results that consist of row IDs (which are typical in column databases). GQ-Fast compiles query plans into executable C++ source code. Besides achieving runtime efficiency, GQ-Fast also reduces main memory requirements because, unlike column databases, GQ-Fast selectively allows dense forms of compression including heavy-weight compressions, which do not support random access. We used GQ-Fast to accelerate queries for two OLAP dashboards in the biomedical field. GQ-Fast outperforms PostgreSQL by 2--4 orders of magnitude and MonetDB, Vertica and Neo4j by 1--3 orders of magnitude when all of them are running on RAM. Our experiments dissect GQ-Fast's advantage between (i) the use of compiled code, (ii) the bottom-up pipelining execution strategy, and (iii) the use of dense structures. Other analysis and experiments show the space savings of GQ-Fast due to the appropriate use of compression methods. We also show that the runtime penalty incurred by the dense compression methods decreases as the number of CPU cores increases.

A Storm-Based Real-Time Micro-Blogging Burst Event Detection System

Conference Paper

Full-text available

Jul 2014

Micro-blogging is becoming an important information source of breaking news event. Since micro-blogs are real-time unbounded stream with complex relationships, traditional burst event detection techniques do not work well. This paper presents the RBEDS which is a real-time burst event detection system following Storm distributed streaming processing framework. K-Means clustering approach and burst feature detection approach are performed to identify candidate burst events, respectively. Their outputs are incorporated to generate final event detection results. Such operation is implemented as a Storm Topology. The proposed system is evaluated on a large Sina micro-blogging dataset. The achieved system performance shows that the RBEDS system may detect burst events with good timeliness, effectiveness and scalability.

Scalable distributed first story detection using storm for twitter data

Article

Full-text available

Jan 2015

Twitter is an online service that enables users to read and post tweets; thereby providing a wealth of information regarding breaking news stories. The problem of First Story Detection is to identify first stories about different events from streaming documents. The Locality sensitive hashing algorithm is the traditional approach used for First Story Detection. The documents have a high degree of lexical variation which makes First Story Detection a very difficult task. This work uses Twitter as the data source to address the problem of real-time First Story Detection. As twitter data contains a lot of spam, we built a dictionary of words to remove spam from the tweets. Further since the Twitter streaming data rate is high, we cannot use traditional Locality sensitive hashing algorithm to detect the first stories. We modify the Locality sensitive hashing algorithm to overcome this limitation while maintaining reasonable accuracy with improved performance. Also, we use Storm distributed platform, so that the system benefits from the robustness, scalability and efficiency that this framework offers.

Querying Issues in Pervasive Environments

Article

Full-text available

Jan 2010

Pervasive computing is all about making information, data, and services available everywhere and anytime. The explosion of huge amounts of data largely distributed and produced by different means (sensors, devices, networks, analysis processes, more generally data services) and the requirements to have queries processed on the right information, at the right place, at the right time has led to new research challenges for querying. For example, query processing can be done locally in the car, on PDA's or mobile phones, or it can be delegated to a distant server accessible through Internet. Data and services can therefore be queried and managed by stationary or nomadic devices, using different networks. The main objective of this chapter is to present a general overview of existing approaches on query processing and the authors' vision on query evaluation in pervasive environments. It illustrates, with scenarios and practical examples, existing data and streams querying systems in pervasive environments. It describes the evaluation process of (i) mobile queries and queries on moving objects, (ii) continuous queries and (iii) stream queries. Finally, the chapter introduces the authors' vision of query processing as a service composition in pervasive environments.

A Brief Review on Leading Big Data Models

Article

Full-text available

Dec 2014
Data Sci J

Today, science is passing through an era of transformation, where the inundation of data, dubbed data deluge is influencing the decision making process. The science is driven by the data and is being termed as data science. In this internet age, the volume of the data as grown up to petabytes, and this large, complex, structured or unstructured, and heterogeneous data in the form of “Big Data” has gained significant attention. The rapid pace of data growth through various disparate sources, especially social media such as Facebook, has seriously challenged the data analytic capabilities of traditional relational databases. The velocity of the expansion of the amount of data gives rise to a complete paradigm shift in how new age data is processed. Confidence in the data engineering of the existing data processing systems is gradually fading whereas the capabilities of the new techniques for capturing, storing, visualizing, and analyzing data are evolving. In this review paper, we discuss some of the modern Big Data models that are leading contributors in the NoSQL era and claim to address Big Data challenges in reliable and efficient ways. Also, we take the potential of Big Data into consideration and try to reshape the original operational oriented definition of “Big Science” (Furner, 2003) into a new data-driven definition and rephrase it as “The science that deals with Big Data is Big Science.”

Scalable distributed event detection for Twitter

Conference Paper

Oct 2013

Social media streams, such as Twitter, have shown themselves to be useful sources of real-time information about what is happening in the world. Automatic detection and tracking of events identified in these streams have a variety of real-world applications, e.g. identifying and automatically reporting road accidents for emergency services. However, to be useful, events need to be identified within the stream with a very low latency. This is challenging due to the high volume of posts within these social streams. In this paper, we propose a novel event detection approach that can both effectively detect events within social streams like Twitter and can scale to thousands of posts every second. Through experimentation on a large Twitter dataset, we show that our approach can process the equivalent to the full Twitter Firehose stream, while maintaining event detection accuracy and outperforming an alternative distributed event detection system.

A Content Based Similarity Search for Trajectory Data

Article

Full-text available

In this paper, we propose a method of a content based similarity search of trajectory data sets for assisting decision of the users' best destination. Recently, many similarity search engines of trajectory data have been proposed by many researchers. However, the algorithms of these similarity search engines only deal with the position of the trajectory data. The algorithms use only physical locations for similarity search are not effective, because most users have interest of moving. We believe that the data of users' interest of moving are also essential to calculate similarities between trajectory data and users' best destination. In this paper, we propose a novel method for calculating similarities of trajectory data using textual metadata. We use descriptive document of the spot the user had stayed. In our proposed method, we use average function to integrate the users' trajectory data and use slope of the similarities. Consequently, we confirmed that the system can calculate accurate similarities for trajectory data. We also confirmed the precision of our proposed method for trajectory data.

The Index as a First-Class Construct in Relational Database Systems

Article

In relational database systems the index is gener-ally a second-class construct: users cannot explic-itly use an index. (In fact, the keyword INDEX is not even defined in the SQL2 (SQL92) standard.) The principle that 'indexes should be used but not seen' has been followed for decades, and is often justified as necessary in order to avoid the com-plexities introduced by explicit access paths and navigational queries. We review arguments for and against this princi-ple, and for making the index a first-class con-struct in relational database systems. For large and complex databases, such as those arising in bioin-formatics, the second-class status of indexing can be in conflict with its importance. The case for a first-class index appears strongest for situations like these. We investigate ways to incorporate first-class in-dexing into the relational database model, sur-facing indexes as functionals. This investigation gives insights about the relational model, and also suggests ways for relational databases to support applications like bioinformatics, for which index-ing is of central importance.

A generic framework for querying and updating secondary XML index structures

Article

Jan 2007

Katharina Grün

To cope with the increasing number and size of XML docu-ments, XML databases provide index structures to acceler-ate queries on the content and structure of documents. To adapt indices to the query workload, XML databases require various secondary index structures. This paper presents a generic index framework called sciens (Structure and Content Indexing with Extensible, Nestable Structures). In contrast to existing work on XML indexing, this framework can integrate arbitrary index structures and adapt them to dierent query requirements. It supports dening, accessing and maintaining indices without aecting query and update processing. By oering a great exibility of what to index, the framework allows for processing queries more eciently.

CBE 2020 book 978-981-16-1941-0

Book

Full-text available

Sep 2021

This book presents the peer-reviewed proceedings of the 2nd International Conference on Computational and Bioengineering (CBE 2020) jointly organized in virtual mode by the Department of Computer Science and the Department of BioScience & Sericulture, Sri Padmavati Mahila Visvavidyalayam (Women's University), Tirupati, Andhra Pradesh, India, during 4–5 December 2020. The book includes the latest research on advanced computational methodologies such as artificial intelligence, data mining and data warehousing, cloud computing, computational intelligence, soft computing, image processing, Internet of things, cognitive computing, wireless networks, social networks, big data analytics, machine learning, network security, computer networks and communications, bioinformatics, biocomputing/biometrics, computational biology, biomaterials, bioengineering, and medical and biomedical informatics.

A Novel Framework for Modeling Medical-Sensitive Big Data Using Document-Based Database

Chapter

Sep 2021

Proceedings of the 2nd International Conference on Computational and Bio Engineering CBE 2020: CBE 2020

Book

Full-text available

Jan 2021

Frequent Pattern Mining over Unstructured Data using Semi-Structured Doc-Model and Pattern Ranking

Article

Mar 2020

Frequent pattern mining is an essential data-mining task, with a goal of discovering knowledge in the form of repeated patterns. Many efficient pattern-mining algorithms have been discovered in the last two decades, yet most do not scale to the type of data we are presented with today, the so-called “Big Data”. Scalable parallel algorithms hold the key to solving the problem in this context. This paper reviews recent advances in parallel frequent pattern mining, analysing them through the Big Data lens. Load balancing and work partitioning are the major challenges to be conquered. These challenges always invoke innovative methods to do, as Big Data evolves with no limits. The biggest challenge than before is conquering unstructured data for finding frequent patterns. To accomplish this Semi Structured Doc-Model and ranking of patterns are used.

Constructive Knowledge Management Model and Information Retrieval Methods for Software Engineering

Chapter

Jan 2014

In this chapter, the authors discuss two important trends in modern software engineering (SE) regarding the utilization of knowledge management (KM) and information retrieval (IR). Software engineering is a discipline in which knowledge and experience, acquired in the course of many years, play a fundamental role. For software development organizations, the main assets are not manufacturing plants, buildings, and machines, but the knowledge held by their employees. Software engineering has long recognized the need for managing knowledge and that the SE community could learn much from the KM community. The authors introduce the fundamental concepts of KM theory and practice and mainly discuss the aspects of knowledge management that are valuable to software development organizations and how a KM system for such an organization can be implemented. In addition to knowledge management, information retrieval (IR) also plays a crucial role in SE. IR is a study of how to efficiently and effectively retrieve a required piece of information from a large corpus of storage entities such as documents. As software development organizations grow larger and have to deal with larger numbers (probably millions) of documents of various types, IR becomes an essential tool for retrieving any piece of information that a software developer wants within a short time. IR can be used both as a general-purpose tool to improve the productivity of developers or as an enabler tool to facilitate a KM system.

Extraction d’information dans des documents manuscrits anciens

Thesis

Dec 2018

Adeline Granet

La tâche d'exploration dans des ressources inexploitées mais nouvellement numérisées, afin d'y trouver des informations pertinentes, est complexifiée par la quantité de ressources disponibles. Grâce au projet ANR CIRESFI, la ressource la plus importante, pour la Comédie-Italienne du XVIIIe siècle, est un ensemble de registres comptables constituée de 28 000 pages. L'extraction d'informations est un processus long et complexe qui demande une expertise à chaque étape : détection et segmentation, extraction de caractéristiques, reconnaissance d’écriture manuscrite. Les systèmes à base de réseaux de neurones profonds dominent dans l'ensemble ces approches. Le problème majeur est qu'ils nécessitent d'avoir une grande quantité de données pour réaliser leur apprentissage. Cependant, les registres de la Comédie-Italienne ne possèdent pas de vérité terrain. Pour palier ce manque de données, nous explorons des approches pouvant opérer un apprentissage par transfert de connaissance. Cela signifie utiliser un ensemble de données déjà étiquetées et disponibles, possédant un minimum de points communs avec nos données pour entraîner les systèmes, pour ensuite les appliquer sur nos données. L'ensemble de nos expérimentations nous ont montré la difficulté de réaliser cette tâche, chaque choix à chaque étape ayant un impact fort sur la suite du système. Nous convergeons vers une solution séparant le modèle optique du modèle de langage afin de réaliser un apprentissage indépendant avec différents types de ressources disponibles et se rejoignant grâce à une projection de l'ensemble des informations dans un espace commun non-latent.

Backgrounds of Searchable Storage

Chapter

Jan 2019

Multi-dimensional data indexing has received much research attention recently in a centralized system. However, it remains a nascent area of research in providing an integrated structure for multiple queries on multi-dimensional data in a distributed environment. We propose a new data structure, called BR-tree (Bloom filter based R-tree), and implement such a prototype in the context of a distributed system. The node in a BR-tree, viewed as an expansion from the traditional R-tree node structure, incorporates space-efficient Bloom filters to facilitate fast membership queries. The proposed BR-tree can simultaneously support not only existing point and range queries but also cover and bound queries that can potentially benefit various data indexing services. Compared with previous data structures, BR-tree achieves space efficiency and provides quick response (\(\le O(log~n)\)) on these four types of queries. Our extensive experiments in a distributed environment further validate the practicality and efficiency of the proposed BR-tree structure (©{2009}IEEE. Reprinted, with permission, from Ref. [1].).

Captree: Spatial and temporal indexing in databases from fixed sensors

Conference Paper

Dec 2016

An Efficient Spatial-Temporal Data Structure for Moving Objects

Article

Jun 2002

A spatial-temporal data structure, called the PMD-tree (Persistent Multi-Dimensional tree), has been proposed for managing the live intervals and locations of spatial objects. In the paper, novel concepts of time space bounding box (TSBB) and motion list are introduced to the PMD-tree to manage moving spatial objects efficiently. A TSBB is an extended bounding box for a moving object that covers the trajectory of the object. As an object moves, a TSBB corresponding to the object is enlarged to enclose the trajectory of the object. A TSBB is divided, when it becomes greater than a limit. An object and corresponding TSBBs are managed by a doubly connected linked list, called motion list. TSBBs are also managed by the PMD-tree. Introducing the concept of TSBB and motion list to the PMD-tree, moving objects can be efficiently managed and be quickly found for spatial-temporal queries. By the series of simulation tests, the storage requirements and search performances are evaluated for several types of moving objects. As a result, the proposed method is superior to the conventional methods.

Ontology-based project level environmental impact assessment database design research and practice in China

Conference Paper

Jun 2015

For the rapid economic and social development, in China, there are tens of thousands of construction projects launched each year. During construction and operation periods, projects will be more or less destroy or impact environment of projects site and surrounding area. In China, project level environmental impact assessment (PEIA) has been executed more than ten years in terms of a state law in order to prevent or mitigate environmental impact of projects. PEIA is a data-driven or data-intensive research work. On one hand, it needs large amounts of dataset support. On the other hand, PEIA will produce abundant data and documents. Therefore, it is urgent to build PEIA database to standardize, integrate, and preserve these data in the long term, and thus to promote their wide sharing and usage. For the successful PEIA database building, database design is a crucial issue that decides what data elements and their attributes should be included and thus influences the range of application of the database. For traditional database design measures, it is hard to figure out all domain entities or objects and represent complex semantic relationships behind entities or objects. In this paper, we propose an ontology-based design method and mapping ontology to the Entity-Relationship (ER) model method for PEIA database. Based on ontology, all concepts of PEIA and their attributes as well as semantic relationships can be clearly and completely designed and transformed to the ER model. By this method, we have built National PEIA of China (NPEIA) that has integrated amounts of basic supporting datasets, such as geo-spatial data, environmental sensitive areas data and so on and more than 100,000 PEIA records that covers 13 different industries projects. With the development of the big data mining and linked open data, in the future we will focus on enriching and opening PEIA database and linking it with SEA (strategic environmental assessment) and PEA (planning environmental assessment). Data deep mining and analysis based on NPEIA and other related datasets are also our research emphasis to support national industrial structural adjustment and total amount control of pollution.

Hybrid Structures

Chapter

Jan 2003

Object-oriented and multimedia databases that are entangled with the manipulation of multi-valued attributes and complex objects in general, tend to raise problems that involve particular properties and needs, such as the constrained domain of set-valued attributes in contrast to the nonconstrained one of text-databases, and the superset, perfect match and subset queries that are frequently asked. The focus of this chapter is on the advantages that could arise from the creation of a hybrid structure handling set-valued attributes in the form of signatures, based on the structures of hashing and the S-tree.

A Flexible Data Store for Managing Bioinformatics Data

Chapter

Dec 2010

Introduction Data Model and System Overview Replication and Load Balancing Evaluation Related Work Summary References

Enhanced Single-Pass Algorithm for Efficient Indexing Using Hashing in Map Reduce Paradigm

Article

Dec 2014

Today the data in the world has reached beyond the sky limits and with the advancement of data-intensive applications there is a need to collect, analyze, process, and retrieve enormous datasets efficiently. This large datasets are popularly termed as ‘‘BIG DATA’’ which was coined by Roger Magoulas, director of market research at O’Reilly Media. To deal with these large datasets different approaches by various data scientists around the world grew and as a result scalable effectuations of information retrieval (IR) operations have become a necessity. MapReduce [1, 2] programming model (Apache’s Hadoop [3], an open source implementation of MapReduce) has emerged as a very effective tool to handle large volume of data in distributed environment. Here with our work we are extending the technique of indexing large data using Single-Pass with hash implementation over MapReduce framework.

Constructive Knowledge Management Model and Information Retrieval Methods for Software Engineering

Article

Jan 2013

In this book chapter, the authors discuss two important trends in modern software engineering (SE) regarding the utilization of knowledge management (KM) and information retrieval (IR). Software engineering is a discipline in which knowledge and experience, acquired in the course of many years, play a fundamental role. For software development organizations, the main assets are not manufacturing plants, buildings, and machines, but the knowledge held by their employees. Software engineering has long recognized the need for managing knowledge and that the SE community could learn much from the KM community. The authors introduce the fundamental concepts of KM theory and practice and mainly discuss the aspects of knowledge management that are valuable to software development organizations and how a KM system for such an organization can be implemented. In addition to knowledge management, information retrieval (IR) also plays a crucial role in SE. IR is a study of how to efficiently and effectively retrieve a required piece of information from a large corpus of storage entities such as documents. As software development organizations grow larger and have to deal with larger numbers (probably millions) of documents of various types, IR becomes an essential tool for retrieving any piece of information that a software developer wants within a short time. IR can be used both as a general-purpose tool to improve the productivity of developers or as an enabler tool to facilitate a KM system.

Joint Navigation in videos and images database

Article

Nov 2006

Ibrahima Mbaye

In this work, we developed FindMeDIA, the purpose of which is to preserve the Moroccan cultural heritage in the form of movies and photographs. Therefore, we had to manage an joint image and video database. Since a video can be perceived as a sequence of fixed images, we have been treating a video as an extension which depends on the modelling of images in a quasi-transparent way. Our main objective was, on the one hand, to meet the genericity and flexibility needs allowing to navigate with different types of visual data, and, on the other hand, to put forward a system which enables us to navigate by moving indistinctly between images and videos. As far as the modelling of video is concerned, we proposed FindViDEO. Both in its model and metamodel parts, FindViDEO is flexible and includes a large spectrum of applications and pre-existing models. For the sake of navigation, we reused a Galois' lattice technique on a database composed of still images and key-frames extracted from videos. The resulting FindMeDIA system is generic and enables us to use many image description techniques for navigation. To test the interest of these approaches, the modelling of key-frames (extracted from videos) as well as still images is carried out by ClickImAGE which proposes a semi-structured representation of data based on the content of images.

Evaluation of hybrid queries based on service coordination

Article

Jul 2011

Victor Cuevas Vicenttin

Recent trends in information technologies result in a massive proliferation of data which are carried over different kinds of networks, produced in either on-demand or streaming fashion, generated and accessible by a variety of devices, and that can involve mobility aspects. This thesis presents an approach for the evaluation of hybrid queries that integrate the various aspects involved in querying continuous, mobile and hidden data in dynamic environments. Our approach consists of representing such an hybrid query as a service coordination comprising data and computation services. A service coordination is specified by a query workflow and additional operator workflows. A query workflow represents an expression built with the operators of our data model. This workflow is constructed from a query specified in our proposed SQL-like query language, HSQL, by an algorithm we developed based on known results of database theory. Operator workflows enable to compose computation services to enable the evaluation of a particular operator. HYPATIA, a service-based hybrid query processor, implements and validates our approach.

Perfect Hashing Base R-tree for multiple queries

Conference Paper

Feb 2014

Indexing of data is important for the fast query response in the information retrieval. Support of multiple query on the multidimensional data is a challenging task. Indexing of multidimensional data received much attention recently. In this paper a new data structure Perfect Hash Base R-tree (PHR-tree) is proposed. Node of PHR-tree is expansion of traditional R-tree node with Perfect Hashing Index to support multiple queries efficiently. It supports point query on the multidimensional data efficiently. It provides space efficiency and fast response to query (O(log n)) on all type of queries.

Modelos Avanzados de Bases de Datos Adaptada al EEES

Article

Full-text available

Joins via Geometric Resolutions: Worst Case and Beyond

Article

Apr 2014

We present a simple geometric framework for the relational join. Using this framework, we design an algorithm that achieves the fractional hypertree-width bound, which generalizes classical and recent worst-case algorithmic results on computing joins. In addition, we use our framework and the same algorithm to show a series of what are colloquially known as beyond worst-case results. The framework allows us to prove results for data stored in Btrees, multidimensional data structures, and even multiple indices per table. A key idea in our framework is formalizing the inference one does with an index as a type of geometric resolution; transforming the algorithmic problem of computing joins to a geometric problem. Our notion of geometric resolution can be viewed as a geometric analog of logical resolution. In addition to the geometry and logic connections, our algorithm can also be thought of as backtracking search with memoization.

A new approach for effectively determining fracture network connections in fractured rocks using R tree indexing

Article

Dec 2011

Determinations of fracture network connections would help the investigators remove those “meaningless” no-flow-passing fractures, providing an updated and more effective fracture network that could considerably improve the computation efficiency in the pertinent numerical simulations of fluid flow and solute transport. The effective algorithms with higher computational efficiency are needed to accomplish this task in large-scale fractured rock masses. A new approach using R tree indexing was proposed for determining fracture connection in 3D stochastically distributed fracture network. By comparing with the traditional exhaustion algorithm, it was observed that from the simulation results, this approach was much more effective; and the more the fractures were investigated, the more obvious the advantages of the approach were. Furthermore, it was indicated that the runtime used for creating the R tree indexing has a major part in the total of the runtime used for calculating Minimum Bounding Rectangles (MBRs), creating the R tree indexing, precisely finding out fracture intersections, and identifying flow paths, which are four important steps to determine fracture connections. This proposed approach for the determination of fracture connections in three-dimensional fractured rocks are expected to provide efficient preprocessing and critical database for practically accomplishing numerical computation of fluid flow and solute transport in large-scale fractured rock masses.

TMIX: Temporal model for indexing XML documents

Conference Paper

Full-text available

May 2013

Different models have been proposed recently for representing temporal data, tracking historical information and retrieving temporal queries results efficiently. We consider the problem of indexing temporal XML documents. In particular, we propose an indexing scheme that uses a summary structure and a matrix that captures the structural relationships as well as time intervals inside a temporal XML document. We introduce an algorithm to efficiently process all types of temporal queries with any depth using our newly proposed index. We show that our proposed index out-performs the state of the art indices in terms of both query processing time and support for different temporal query types.

Fundamental Access Methods

Article

Jan 2000

A major performance goal of a DBMS is to minimize the number of I/O’s (i.e., blocks or pages transferred) between the disk and main memory. One way to achieve this goal is to minimize the number of I/O’s when answering a query. Note that many queries reference only a small portion of the records in a database file. For example the query: “find the employees who reside in Santa Monica” references only a fraction of the records in the Employee relation. It would be very inefficient to have the database system sequentially read all the pages of the Employee file and check the residence field of each employee record for the name “Santa Monica”. Instead the system should be able to locate the pages with “Santa Monica” employee records directly. To allow such fast access additional data structures called access methods are designed per database file. There are two fundamental access methods, namely indexing and hashing. The most widely used indexing scheme is the B+-tree. Hashing is also common, in particular in its Extendible and Linear Hashing schemes. We also describe two multi-attribute access methods, the k-d tree and the Grid File. Finally, we discuss an approach that is popular for document searching, the Inverted File.

COMPUTATIONAL ANALYSIS OF 3D PROTEIN STRUCTURES

Article

Full-text available

Zeyar Aung

LH* RS ---a highly-available scalable distributed data structure

Article

Full-text available

Sep 2005

LH&ast;RS is a high-availability scalable distributed data structure (SDDS). An LH&ast;RS file is hash partitioned over the distributed RAM of a multicomputer, for example, a network of PCs, and supports the unavailability of any k &geq; 1 of its server nodes. The value of k transparently grows with the file to offset the reliability decline. Only the number of the storage nodes potentially limits the file growth. The high-availability management uses a novel parity calculus that we have developed, based on Reed-Salomon erasure correcting coding. The resulting parity storage overhead is about the lowest possible. The parity encoding and decoding are faster than for any other candidate coding we are aware of. We present our scheme and its performance analysis, including experiments with a prototype implementation on Wintel PCs. The capabilities of LH&ast;RS offer new perspectives to data intensive applications, including the emerging ones of grids and of P2P computing.

A Data Generator for Evaluating Spatial Issues in Precision Agriculture

Article

Sep 2002

With the rapid rise in site-specific data collection, many research efforts have been directed towards finding optimal sampling and analysis procedures. However, the absence of widely available high quality precision agriculture data sets makes it difficult to compare results from separate experiments and to assess the optimality and applicability of procedures. To provide a tool for spatial data experimentation, we have developed a spatial data generator that allows users to produce data layers with given spatial properties and a response variable (e.g. crop yield) dependent upon user specified functions. Differences in response functions within fields can be simulated by assigning different models to regions in coordinate-(x and y) or feature space (multidimensional space of attributes that may have an influence on response). Noise, either unexplained variance or sensor error, can be added to all spatial layers. Sampling and interpolation error is modeled by sampling a continuous data layer and interpolating values at unsampled locations. The program has been successfully tested for up to 15000 grid points, 10 features and 5 models. As an illustration of the potential uses of generated data, the effect of sampling density and kriging interpolation on neural network prediction of crop yield was assessed. Yield prediction accuracy was highly related (correlation coefficient 0.98) to the accuracy of the interpolated layers indicating that unless data are sampled at very high densities relative to their geostatistical properties, one should not attempt to build highly accurate regression functions using interpolated data. By allowing users to generate large amounts of data with controlled complexity and features, the spatial data generator should facilitate the development of improved sampling and analysis procedures for spatial data.

Evaluating different approaches for indexing fuzzy sets

Article

Nov 2003
FUZZY SET SYST

Sven Helmer

Providing efficient query processing in database systems is one step towards gaining acceptance of such systems by end users. We propose several techniques for indexing fuzzy sets in databases to improve the query evaluation performance. Three of the presented access methods are based on superimposed coding, while the fourth relies on inverted files. The efficiency of these techniques was evaluated experimentally. We present results from these experiments, which clearly show the superiority of the inverted files.

ResearchGate has not been able to resolve any references for this publication.

Indexing Techniques for Advanced Database Systems

Abstract

Chapters (6)

Recommended publications

Offenders in focus: Risk, responsivity and diversity

Bubble Sort: An Archaeological Algorithmic Analysis

Learning to teach: New times, new practices (A BOOK)

A failed attempt at developing a search filter for systematic review methodology articles in Ovid Em...