Figure 3 - uploaded by Abhishek Santra
Content may be subject to copyright.
Architecture of the PGQP System

Architecture of the PGQP System

Source publication
Article
Full-text available
Graphs, being an expressive data structure, have become increasingly important for modeling real-world applications, such as collaboration, different kinds of transactions, social networks, to name a few. With the advent of social networks and the web, the graph sizes have grown too large to fit in main memory precipitating the need for alternative...

Contexts in source publication

Context 1
... to the PGQP system is still a graph database and one or more queries. QP-Subdue architecture has been extended (shown in Figure 3) to accept a partition with additional cut set information (instead of the whole graph) and 5 process a query from a set of specified starting points. Our approach uses files for communication between iterations. ...
Context 2
... the end of an iteration (corresponds to processing one or more partitions independently), each partition appropriate information is written for continuing query processing using other partitions. The shaded modules in Figure 3 are extensions for the partitioned approach. The non-shaded modules correspond to pre-existing systems/modules that are used, such as METIS/KaHIP, catalog generator, and plan generator. ...
Context 3
... the important aspects of implementing the PQGP system (whose architecture is shown in Figure 3) is briefly summarized in this section. For additional details, refer to [1]. ...

Similar publications

Article
Full-text available
Database outsourcing is a challenge concerning data secrecy. Even if an adversary, including the service provider, accesses the data, she should not be able to learn any information from the accessed data. In this paper, we address this problem for graph-structured data. First, we define a secrecy notion for graph-structured data based on the conce...

Citations

... Hence, putting up this qualityaware fuzzy queries so that entering preferences are much user-friendly and intuitive. Another study suggested resolutions to tradeoffs in response time and scalability [7]. The proponents used divide and conquer approach through graph partitions and identified quantitative metrics for heuristics formulation to regulate the stocking division sequence for query procedural efficiency. ...
... However, with the emergence of data sets with relationships among entities and complex application requirements, such as shortest paths, important neighborhoods, dominant nodes (or groups of nodes), etc, [54,43], the relational data model was not the best choice for modeling as well as analyzing them [38]. This led to the evolution of NoSQL data models including the graph data model [29]. ...
... Advantages. Attribute graphs have been successfully used in subgraph mining [41,58,53], querying [54,43,42] and searching [52] over multi-entity types and multi-feature data sets. They capture more semantic information than simple graphs, and can handle both multiple types of features and entities. ...
Preprint
Full-text available
Any large complex data analysis to infer or discover meaningful information/knowledge involves the following steps (in addition to data collection, cleaning, preparing the data for analysis such as attribute elimination): i) Modeling the data -- an approach for modeling and deriving a data representation for analysis using that approach, ii) translating analysis objectives into computations on the model generated; this can be as simple as a single computation (e.g., community detection) or may involve a sequence of operations (e.g., pair-wise community detection over multiple networks) using expressions based on the model, iii) computation of the expressions generated -- efficiency and scalability come into picture here, and iv) drill-down of results to interpret or understand them clearly. Beyond this, it is also meaningful to visualize results for easier understanding. Covid-19 visualization dashboard presented in this paper is an example of this. This paper covers all of the above steps of data analysis life cycle using a data representation that is gaining importance for multi-entity, multi-feature data sets - Multilayer Networks. We use several data sets to establish the effectiveness of modeling using MLNs and analyze them using the proposed decoupling approach. For coverage, we use different types of MLNs for modeling, and community and centrality computations for analysis. The data sets used - US commercial airlines, IMDb, DBLP, and Covid-19 data set. Our experimental analyses using the identified steps validate modeling, breadth of objectives that can be computed, and overall versatility of the life cycle approach. Correctness of results is verified, where possible, using independently available ground truth. We demonstrate drill-down that is afforded by this approach (due to structure and semantics preservation) for a better understanding and visualization of results.
... However, with the emergence of structured data sets with inherent relationships among entities and complex application requirements, such as shortest paths, important neighborhoods, dominant nodes (or groups of nodes), etc., [7,12], the relational data model was not the best choice for modeling as well as analyzing them [5]. This led to the evolution of NoSQL data models including the graph data model [3]. ...
Chapter
Extended Entity Relationship (or EER) modeling is an important step after application requirements for data analysis are gathered, and is critical for translating user requirements to a given executable data model (e.g., relational, or for this paper Multilayer Networks or MLNs.) EER modeling provides a more precise understanding of the application and data requirements and an unambiguous representation from which the data model (on which analysis is performed) can be generated algorithmically. EER has played a central role in the modeling of user-level requirements to relational, object oriented etc. UML, whose roots are in EER modeling, is extensively used in the industry.
... • The second article [2], whose title is "Query Processing on Large Graphs: Approaches to Scalability and Response Time Trade Offs", studies parallel processing of queries in a distributed system. This paper proposes to partition a graph into subgraphs for parallel processing instead of redistributing edges by vertex. ...
Article
Full-text available
Graph problems are significantly harder to solve with large graphs residing on disk compared to main memory only. In this work, we study how to solve four important graph problems: reachability from a source vertex, single source shortest path, weakly connected components, and PageRank. It is well known that the aforementioned algorithms can be expressed as an iteration of matrix–vector multiplications under different semi-rings. Based on this mathematical foundation, we show how to express the computation with standard relational queries and then we study how to efficiently evaluate them in parallel in a shared-nothing architecture. We identify a common algorithmic pattern that unifies the four graph algorithms, considering a common mathematical foundation based on sparse matrix–vector multiplication. The net gain is that our SQL-based approach enables solving “big data” graph problems on parallel database systems, debunking common wisdom that they are cumbersome and slow. Using large social networks and hyper-link real data sets, we present performance comparisons between a columnar DBMS, an open-source array DBMS, and Spark’s GraphX.
Article
Data collecting, caching, analysis, and processing in close proximity to where the data is collected is referred to as "edge intelligence," a group of linked devices and systems. Edge Intelligence aims to improve data processing quality and speed while also safeguarding the data's privacy and security. This area of study, which dates just from 2011, has shown tremendous development in the last five years, despite its relative youth. This paper provides a survey of the architectures of edge intelligence (Data Placement-Based Architectures to Reduce Latency; 2) Orchestration-Based ECAs- IoT. 3) Big Data Analysis-Based Architectures; and 4) Security-Based Architectures) as well as the challenges and solutions for innovative architectures in edge intelligence.
Article
Analysis of complex data sets to infer/discover meaningful information/knowledge involves (after data collection and cleaning): (i) Modeling the data – an approach for deriving a suitable representation of data for analysis, (ii) translating analysis objectives into computations on the generated model instance; these computations can be as simple as a query or a complex computation (e.g., community detection over multiple layers), (iii) computation of expressions generated – considering efficiency and scalability, and (iv) drill-down of results to understand them clearly. Beyond this, it is also useful to visualize results for easier understanding. Covid-19 visualization dashboard presented in this paper is an example of this. This paper covers the above steps of data analysis life cycle using a representation (or model) that is gaining importance. With complex data sets containing multiple entity types and relationships, an appropriate model to represent the data is important. For these data sets, we first establish the advantages of Multilayer Networks (or MLNs) as a data model. Then we use an entity-relationship based approach to convert the data set into MLNs for a precise representation of the data set. After that, we outline how expected analysis objectives can be translated using keyword-mapping to aggregate analysis expressions. Finally, we demonstrate, through a set of example data sets and objectives, how the expressions corresponding to objectives are evaluated using an efficient decoupling-based approach. Results are further drilled down to obtain actionable knowledge from the data set. Using the widely popular Enhanced Entity Relationship (EER) approach for requirements representation, we demonstrate how to generate EER diagrams for data sets and further generate, algorithmically, MLNs as well as Relational schema for analysis and drill down, respectively. Using communities and centrality for aggregate analysis, we demonstrate the flexibility of the chosen model to support diverse set of objectives. We also show that compared to current analysis approaches, a “decoupling-based” approach using MLNs is more appropriate as it preserves structure as well as semantics of the results and is very efficient. For this computation, we need to derive expressions for each analysis objective using the MLN model. We provide guidelines to translate English queries into analysis expressions based on keywords. Finally, we use several data sets to establish the effectiveness of modeling using MLNs and their analysis using the decoupling approach that has been proposed recently. For coverage, we use different types of MLNs for modeling, and community and centrality computations for analysis. The data sets used are from US commercial airlines, IMDb (a large international movie data set), the familiar DBLP (or bibliography database), and the Covid-19 data set. Our experimental analyses using the identified steps validate modeling, breadth of objectives that can be computed, and overall versatility of the life cycle approach. Correctness of results is verified, where possible, using independently available ground truth. Furthermore, we demonstrate drill-down that is afforded by this approach (due to structure and semantics preservation) for a better understanding and visualization of results.
Chapter
Triangle enumeration is a fundamental problem in large-scale graph analysis. For instance, triangles are used to solve practical problems like community detection and spam filtering. On the other hand, there is a large amount of data stored on database management systems (DBMSs), which can be modeled and analyzed as graphs. Alternatively, graph data can be quickly loaded into a DBMS. Our paper shows how to adapt and optimize a randomized distributed triangle enumeration algorithm with SQL queries, which is a significantly different approach from programming graph algorithms in traditional languages such as Python or C++. We choose a parallel columnar DBMS given its fast query processing, but our solution should work for a row DBMS as well. Our randomized solution provides a balanced workload for parallel query processing, being robust to the existence of skewed degree vertices. We experimentally prove our solution ensures a balanced data distribution, and hence workload, among machines. The key idea behind the algorithm is to evenly partition all possible triplets of vertices among machines, sending edges that may form a triangle to a proxy machine; this edge redistribution eliminates shuffling edges during join computation and therefore triangle enumeration becomes local and fully parallel. In summary, our algorithm exhibits linear speedup with large graphs, including graphs that have high skewness in vertex degree distributions.