Figure 5 - uploaded by Peng Peng
Content may be subject to copyright.
Example Vertical Fragment 

Example Vertical Fragment 

Source publication
Article
Full-text available
As the volume of the RDF data becomes increasingly large, it is essential for us to design a distributed database system to manage it. For distributed RDF data design, it is quite common to partition the RDF data into some parts, called fragments, which are then distributed. Thus, the distribution design consists of two steps: fragmentation and all...

Context in source publication

Context 1
... 1. Given the frequent access pattern p 3 in Figure 4, Figure 5 shows the corresponding vertical fragment. ...

Citations

... In contrast, we mine the complex subquery pattern from historical workloads and only generate the triple partitions whose predicates are contained in the pattern. [52] also explores the intrinsic similarities among the structures of queries in a workload for graph partition. While [52] divides a graph into many partitions and designs a cost-aware algorithm to distribute the partitions among sites, we propose a Q-learning-based physical design tuner to choose the triple partitions that need to be migrated from the relational store to graph store. ...
... [52] also explores the intrinsic similarities among the structures of queries in a workload for graph partition. While [52] divides a graph into many partitions and designs a cost-aware algorithm to distribute the partitions among sites, we propose a Q-learning-based physical design tuner to choose the triple partitions that need to be migrated from the relational store to graph store. Therefore, both our problems and approaches are different. ...
Article
To effectively manage increasing knowledge graphs in various domains, a hot research topic, knowledge graph storage management, has emerged. Existing methods are classified to relational stores and native graph stores. Relational stores are able to store large-scale knowledge graphs and convenient in updating knowledge, but the query performance weakens obviously when the selectivity of a knowledge graph query is large. Native graph stores are efficient in processing complex knowledge graph queries due to its index-free adjacent property, but they are inapplicable to manage a large-scale knowledge graph due to limited storage budgets or inflexible updating process. Motivated by this, we propose a dual-store structure which leverages a graph store to accelerate the complex query process in the relational store. However, it is challenging to determine what data to transfer from relational store to graph store at what time. To address this problem, we formulate it as a Markov Decision Process and derive a physical design tuner ${{\sf DOTIL}}$ based on reinforcement learning. With ${{\sf DOTIL}}$ , the dual-store structure is adaptive to dynamic changing workloads. Experimental results on real knowledge graphs demonstrate that our proposed dual-store structure improves query performance up to average 50.11 percent compared with the most commonly used relational stores.
... The gain of reassigning a vertex from its source to a target partition is how many more neighbors it has in the target than the source partition. Peng et al. [36,37] propose a workload-driven partitioning method that mines frequent query patterns from a representative query workload. Then it puts matches of the same frequent pattern into the same fragment to improve the workload throughout. ...
... Competitors Regarding Table 1, we compare WASP against Hermes [33] that is the only strategy with no prior knowledge of query workloads. We also compare against the Peng et al. method [36] as a representative of strategies that are based on knowing a priori query workload. It has already shown to have a faster query response time against WARP [21] and Partout [17]. ...
... However, since Hermes does not consider the weight of active edges, the corresponding decreasing rate is lower than WASP. On the other hand, despite WASP and Hermes that initially partition the graph dataset via the simple hashing strategy, the Peng et al. method [36] partitions the whole graph dataset assuming a priori knowledge of the WatDiv-SW workload. Therefore, the diagram shows an almost steady IPT ratio less than 0.1, as the Peng et al. method has already assigned the matches of each frequent pattern to the same partition. ...
Article
Full-text available
Streaming graph partitioning methods have recently gained attention due to their ability to scale to very large graphs with limited resources. However, many such methods do not consider workload and graph characteristics. This may degrade the performance of queries by increasing inter-node communication and computational load imbalance. Moreover, existing workload-aware methods cannot consistently provide good performance as they do not consider dynamic workloads that keep emerging in graph applications. We address these issues by proposing a novel workload-adaptive streaming partitioner named WASP, that aims to achieve low-latency and high-throughput online graph queries. As each workload typically contains frequent query patterns, WASP exploits the existing workload to capture active vertices and edges which are frequently visited and traversed, respectively. This information is used to heuristically improve the quality of partitions either by avoiding the concentration of active vertices in a few partitions proportional to their visit frequencies or by reducing the probability of the cut of active edges proportional to their traversal frequencies. In order to assess the impact of WASP on a graph store and to show how easily the approach can be plugged on top of the system, we exploit it in a distributed graph-based RDF store. Our experiments over three synthetic and real-world graph datasets and the corresponding static and dynamic query workloads show that WASP achieves a better query performance against state-of-the-art graph partitioners, especially in dynamic query workloads.
... Graphs recently are adopted to represent complicated structures, and graph database widely have been applied in many domains, such as chemistry [1,2], image [3,4], knowledge graph [5,6], social network [7,8] and XML documents [9,10], etc. For example, in social network, individuals (or organizations) and their pairwise relationships are modeled as the vertices and edges, respectively. ...
Article
Full-text available
Graph and graph database are widely used in many domains, and the graph querying attracts more and more attentions. Among these querying problems, subgraph querying is the most compelling one, since it contains very expensive subgraph isomorphism. The paper proposes a novel subgraph querying method PLGCoding, which use some information of shortest paths and Laplacian spectra to filter out false positives. Specifically, we first extract some features, including some information of vertices, edges, the shortest paths and Laplacian spectra, and encode extracted features. An index PLGCode-Tree is built based on codes to shrink the candidate set. Then, we propose two-step filtering strategy to implement the filtering-and-verification framework and thus generate the answer set. Compared with competing methods on real dataset, experimental results show PLGCoding can improve the querying efficiency.
... Send B v to S c ; 16 Receive B v from S c ; ...
... There have been many works on distributed SPARQL query processing, and a very good survey is [12]. Recently, some approaches such as [5], [23], [22], [9], [2], [16], [8], [20], [10] have been proposed. ...
... Second, some approaches [23], [22], [9], [2], [16] are partition-based. They divide an RDF graph into several partitions. ...
Preprint
Full-text available
Partial evaluation has recently been used for processing SPARQL queries over a large resource description framework (RDF) graph in a distributed environment. However, the previous approach is inefficient when dealing with complex queries. In this study, we further improve the "partial evaluation and assembly" framework for answering SPARQL queries over a distributed RDF graph, while providing performance guarantees. Our key idea is to explore the intrinsic structural characteristics of partial matches to filter out irrelevant partial results, while providing performance guarantees on a network trace (data shipment) or the computational cost (response time). We also propose an efficient assembly algorithm to utilize the characteristics of partial matches to merge them and form final results. To improve the efficiency of finding partial matches further, we propose an optimization that communicates variables' candidates among sites to avoid redundant computations. In addition, although our approach is partitioning-tolerant, different partitioning strategies result in different performances, and we evaluate different partitioning strategies for our approach. Experiments over both real and synthetic RDF datasets confirm the superiority of our approach.
... Graph-based partitioning is an NP-complete problem [14], and hence hash partitioning heuristics [21,31] are employed instead of graph-based partitioning in order to partition RDF data efficiently. However, sophisticated partitioning techniques [11,15,22,28] cannot guarantee that no data will be shuffled when processing complex queries with multiple joins. Several techniques [23,29] utilize the query workload to enhance the partitioning of RDF data. ...
... SPARQL queries containing multiple triple patterns are resolved by using merge and index joins. Peng et al. (2016) proposed a method to distribute and allocate the RDF partitions by exploring the intrinsic similarities among the structures of queries in the executed workload to reduce the number of crossing matches and the communication cost during query processing. In particular, the proposed approach mines and selects some of the frequent access patterns that reflect the characteristics of the workload. ...
Article
Full-text available
The Resource Description Framework (RDF) represents a main ingredient and data representation format for Linked Data and the Semantic Web. It supports a generic graph-based data model and data representation format for describing things, including their relationships with other things. As the size of RDF datasets is growing fast, RDF data management systems must be able to cope with growing amounts of data. Even though physically handling RDF data using a relational table is possible, querying a giant triple table becomes very expensive because of the multiple nested joins required for answering graph queries. In addition, the heterogeneity of RDF Data poses entirely new challenges to database systems. This article provides a comprehensive study of the state of the art in handling and querying RDF data. In particular, we focus on data storage techniques, indexing strategies, and query execution mechanisms. Moreover, we provide a classification of existing systems and approaches. We also provide an overview of the various benchmarking efforts in this context and discuss some of the open problems in this domain.
... Workload sensitive partitioners [4,22,23,25,27,31] attempt to optimise the placement of data to suit a particular workload. Such systems may be streaming or non-streaming, but are discussed separately here because they pertain most closely to the work we do with Loom. ...
... In the domain of RDF stores, Peng et al. [23] use frequent subgraph mining ahead of time to select a set of patterns common to a provided SPARQL query workload. They then propose partitioning strategies which ensure that any data matching one of these frequent patterns is allocated wholly within a single partition, thus reducing average query response time at the cost of having to replicate (potentially many) sub-graphs which form part of multiple frequent patterns. ...
Article
As with general graph processing systems, partitioning data over a cluster of machines improves the scalability of graph database management systems. However, these systems will incur additional network cost during the execution of a query workload, due to inter-partition traversals. Workload-agnostic partitioning algorithms typically minimise the likelihood of any edge crossing partition boundaries. However, these partitioners are sub-optimal with respect to many workloads, especially queries, which may require more frequent traversal of specific subsets of inter-partition edges. Furthermore, they largely unsuited to operating incrementally on dynamic, growing graphs. We present a new graph partitioning algorithm, Loom, that operates on a stream of graph updates and continuously allocates the new vertices and edges to partitions, taking into account a query workload of graph pattern expressions along with their relative frequencies. First we capture the most common patterns of edge traversals which occur when executing queries. We then compare sub-graphs, which present themselves incrementally in the graph update stream, against these common patterns. Finally we attempt to allocate each match to single partitions, reducing the number of inter-partition edges within frequently traversed sub-graphs and improving average query performance. Loom is extensively evaluated over several large test graphs with realistic query workloads and various orderings of the graph updates. We demonstrate that, given a workload, our prototype produces partitionings of significantly better quality than existing streaming graph partitioning algorithms Fennel and LDG.
Chapter
The Resource Description Framework (RDF) is widely used to model web data. The scale and complexity of the modeled data emphasized performance challenges on the RDF-triple stores. Workload adaption is one important strategy to deal with those challenges on the storage level. In all the current adaptation approaches, the workload statistics are built collectively, and the analysis process is not aware of old or recent items in the workloads. However, that does not simulate the timely trends that exist naturally in user queries and causes the analysis process to lag behind the rapid workload development. In this work, we model the workload statistics as time series and apply well-known smoothing techniques allowing the importance of the workload to decay over time. We apply the proposed approach on UniAdapt [1] which follows a unified and comprehensive storage adaption process.KeywordsRDFTriple-storesWorkload adaption