Architecture of the PGQP System

Source publication

Query processing on large graphs: Approaches to scalability and response time trade offs

Article

Full-text available

Sep 2019

Graphs, being an expressive data structure, have become increasingly important for modeling real-world applications, such as collaboration, different kinds of transactions, social networks, to name a few. With the advent of social networks and the web, the graph sizes have grown too large to fit in main memory precipitating the need for alternative...

Context 1

... to the PGQP system is still a graph database and one or more queries. QP-Subdue architecture has been extended (shown in Figure 3) to accept a partition with additional cut set information (instead of the whole graph) and 5 process a query from a set of specified starting points. Our approach uses files for communication between iterations. ...

View in full-text

Context 2

... the end of an iteration (corresponds to processing one or more partitions independently), each partition appropriate information is written for continuing query processing using other partitions. The shaded modules in Figure 3 are extensions for the partitioned approach. The non-shaded modules correspond to pre-existing systems/modules that are used, such as METIS/KaHIP, catalog generator, and plan generator. ...

View in full-text

Context 3

... the important aspects of implementing the PQGP system (whose architecture is shown in Figure 3) is briefly summarized in this section. For additional details, refer to [1]. ...

View in full-text

Fig. 6 E TotalEB obtained for the synthetic and real datasets

Percentage of E dummy for the real datasets Dataset maxEdges % of E dummy

System architecture and query processing

Secrecy and performance models for query processing on outsourced graph data

Article

Full-text available

Mar 2021

Database outsourcing is a challenge concerning data secrecy. Even if an adversary, including the service provider, accesses the data, she should not be able to learn any information from the accessed data. In this paper, we address this problem for graph-structured data. First, we define a secrecy notion for graph-structured data based on the conce...

Content-based Fashion Recommender System Using Unsupervised Learning

Conference Paper

Full-text available

Dec 2021

From Base Data To Knowledge Discovery -- A Life Cycle Approach -- Using Multilayer Networks

Preprint

Full-text available

May 2021

Any large complex data analysis to infer or discover meaningful information/knowledge involves the following steps (in addition to data collection, cleaning, preparing the data for analysis such as attribute elimination): i) Modeling the data -- an approach for modeling and deriving a data representation for analysis using that approach, ii) translating analysis objectives into computations on the model generated; this can be as simple as a single computation (e.g., community detection) or may involve a sequence of operations (e.g., pair-wise community detection over multiple networks) using expressions based on the model, iii) computation of the expressions generated -- efficiency and scalability come into picture here, and iv) drill-down of results to interpret or understand them clearly. Beyond this, it is also meaningful to visualize results for easier understanding. Covid-19 visualization dashboard presented in this paper is an example of this. This paper covers all of the above steps of data analysis life cycle using a data representation that is gaining importance for multi-entity, multi-feature data sets - Multilayer Networks. We use several data sets to establish the effectiveness of modeling using MLNs and analyze them using the proposed decoupling approach. For coverage, we use different types of MLNs for modeling, and community and centrality computations for analysis. The data sets used - US commercial airlines, IMDb, DBLP, and Covid-19 data set. Our experimental analyses using the identified steps validate modeling, breadth of objectives that can be computed, and overall versatility of the life cycle approach. Correctness of results is verified, where possible, using independently available ground truth. We demonstrate drill-down that is afforded by this approach (due to structure and semantics preservation) for a better understanding and visualization of results.

EERMLN: EER Approach for Modeling, Mapping, and Analyzing Complex Data Using Multilayer Networks (MLNs)

Chapter

Oct 2020

Extended Entity Relationship (or EER) modeling is an important step after application requirements for data analysis are gathered, and is critical for translating user requirements to a given executable data model (e.g., relational, or for this paper Multilayer Networks or MLNs.) EER modeling provides a more precise understanding of the application and data requirements and an unambiguous representation from which the data model (on which analysis is performed) can be generated algorithmically. EER has played a central role in the modeling of user-level requirements to relational, object oriented etc. UML, whose roots are in EER modeling, is extensively used in the industry.

Scalable parallel graph algorithms with matrix–vector multiplication evaluated with queries

Article

Full-text available

Dec 2017
DISTRIB PARALLEL DAT

Graph problems are significantly harder to solve with large graphs residing on disk compared to main memory only. In this work, we study how to solve four important graph problems: reachability from a source vertex, single source shortest path, weakly connected components, and PageRank. It is well known that the aforementioned algorithms can be expressed as an iteration of matrix–vector multiplications under different semi-rings. Based on this mathematical foundation, we show how to express the computation with standard relational queries and then we study how to efficiently evaluate them in parallel in a shared-nothing architecture. We identify a common algorithmic pattern that unifies the four graph algorithms, considering a common mathematical foundation based on sparse matrix–vector multiplication. The net gain is that our SQL-based approach enables solving “big data” graph problems on parallel database systems, debunking common wisdom that they are cumbersome and slow. Using large social networks and hyper-link real data sets, we present performance comparisons between a columnar DBMS, an open-source array DBMS, and Spark’s GraphX.

Assessment of Innovative Architectures, Challenges and Solutions of Edge Intelligence

Article

Oct 2022

Data collecting, caching, analysis, and processing in close proximity to where the data is collected is referred to as "edge intelligence," a group of linked devices and systems. Edge Intelligence aims to improve data processing quality and speed while also safeguarding the data's privacy and security. This area of study, which dates just from 2011, has shown tremendous development in the last five years, despite its relative youth. This paper provides a survey of the architectures of edge intelligence (Data Placement-Based Architectures to Reduce Latency; 2) Orchestration-Based ECAs- IoT. 3) Big Data Analysis-Based Architectures; and 4) Security-Based Architectures) as well as the challenges and solutions for innovative architectures in edge intelligence.

From base data to knowledge discovery – A life cycle approach – Using multilayer networks

Article

Aug 2022
DATA KNOWL ENG

Analysis of complex data sets to infer/discover meaningful information/knowledge involves (after data collection and cleaning): (i) Modeling the data – an approach for deriving a suitable representation of data for analysis, (ii) translating analysis objectives into computations on the generated model instance; these computations can be as simple as a query or a complex computation (e.g., community detection over multiple layers), (iii) computation of expressions generated – considering efficiency and scalability, and (iv) drill-down of results to understand them clearly. Beyond this, it is also useful to visualize results for easier understanding. Covid-19 visualization dashboard presented in this paper is an example of this. This paper covers the above steps of data analysis life cycle using a representation (or model) that is gaining importance. With complex data sets containing multiple entity types and relationships, an appropriate model to represent the data is important. For these data sets, we first establish the advantages of Multilayer Networks (or MLNs) as a data model. Then we use an entity-relationship based approach to convert the data set into MLNs for a precise representation of the data set. After that, we outline how expected analysis objectives can be translated using keyword-mapping to aggregate analysis expressions. Finally, we demonstrate, through a set of example data sets and objectives, how the expressions corresponding to objectives are evaluated using an efficient decoupling-based approach. Results are further drilled down to obtain actionable knowledge from the data set. Using the widely popular Enhanced Entity Relationship (EER) approach for requirements representation, we demonstrate how to generate EER diagrams for data sets and further generate, algorithmically, MLNs as well as Relational schema for analysis and drill down, respectively. Using communities and centrality for aggregate analysis, we demonstrate the flexibility of the chosen model to support diverse set of objectives. We also show that compared to current analysis approaches, a “decoupling-based” approach using MLNs is more appropriate as it preserves structure as well as semantics of the results and is very efficient. For this computation, we need to derive expressions for each analysis objective using the MLN model. We provide guidelines to translate English queries into analysis expressions based on keywords. Finally, we use several data sets to establish the effectiveness of modeling using MLNs and their analysis using the decoupling approach that has been proposed recently. For coverage, we use different types of MLNs for modeling, and community and centrality computations for analysis. The data sets used are from US commercial airlines, IMDb (a large international movie data set), the familiar DBLP (or bibliography database), and the Covid-19 data set. Our experimental analyses using the identified steps validate modeling, breadth of objectives that can be computed, and overall versatility of the life cycle approach. Correctness of results is verified, where possible, using independently available ground truth. Furthermore, we demonstrate drill-down that is afforded by this approach (due to structure and semantics preservation) for a better understanding and visualization of results.

A Scalable Randomized Algorithm for Triangle Enumeration on Graphs Based on SQL Queries

Chapter

Sep 2020

Triangle enumeration is a fundamental problem in large-scale graph analysis. For instance, triangles are used to solve practical problems like community detection and spam filtering. On the other hand, there is a large amount of data stored on database management systems (DBMSs), which can be modeled and analyzed as graphs. Alternatively, graph data can be quickly loaded into a DBMS. Our paper shows how to adapt and optimize a randomized distributed triangle enumeration algorithm with SQL queries, which is a significantly different approach from programming graph algorithms in traditional languages such as Python or C++. We choose a parallel columnar DBMS given its fast query processing, but our solution should work for a row DBMS as well. Our randomized solution provides a balanced workload for parallel query processing, being robust to the existence of skewed degree vertices. We experimentally prove our solution ensures a balanced data distribution, and hence workload, among machines. The key idea behind the algorithm is to evenly partition all possible triplets of vertices among machines, sending edges that may form a triangle to a proxy machine; this edge redistribution eliminates shuffling edges during join computation and therefore triangle enumeration becomes local and fully parallel. In summary, our algorithm exhibits linear speedup with large graphs, including graphs that have high skewness in vertex degree distributions.

Guest Editorial-DaWaK 2018 Special Issue-Trends in Big Data Analytics

Article

Sep 2019
DATA KNOWL ENG

Architecture of the PGQP System

Contexts in source publication

Similar publications

Citations