Conference PaperPDF Available

A Combined Semi-pipelined Query Processing Architecture for Distributed Full-Text Retrieval

Authors:

Abstract

Term-partitioning is an efficient way to distribute a large inverted index. Two fundamentally different query processing approaches are pipelined and non-pipelined. While the pipelined approach provides higher query throughput, the non-pipelined approach provides shorter query latency. In this work we propose a third alternative, combining non-pipelined inverted index access, heuristic decision between pipelined and non-pipelined query execution and an improved query routing strategy. From our results, the method combines the advantages of both approaches and provides high throughput and short query latency. Our method increases the throughput by up to 26% compared to the non-pipelined approach and reduces the latency by up to 32% compared to the pipelined.
simonj@idi.ntnu.no www.iad-centre.no
A Combined Semi-Pipelined Query Processing
Architecture For Distributed Full-Text Retrieval!
Simon Jonassen and Svein Erik Bratsberg !
Department of Computer and Information Science!
Norwegian University of Science and Technology"
The 11th International Conference on Web Information System Engineering"
Hong Kong, China"
12-14 December, 2010"
simonj@idi.ntnu.no www.iad-centre.no
Outline
Introduction to distributed inverted indexes
Problem definition and motivation
Our approach
Experimental evaluation
Conclusions
simonj@idi.ntnu.no www.iad-centre.no
Inverted index approach to IR
© apple.com
simonj@idi.ntnu.no www.iad-centre.no
Inverted index approach to IR
© apple.com
?
? ?
? ?
simonj@idi.ntnu.no www.iad-centre.no
Document-wise partitioning
Each node indexes a subset of documents
simonj@idi.ntnu.no www.iad-centre.no
Document-wise partitioning
A query q is broadcasted to all of the nodes and
executed concurrently.
One of the nodes has to combine
results.
simonj@idi.ntnu.no www.iad-centre.no
Document-wise partitioning
A query q is broadcasted to all of the nodes and
executed concurrently.
One of the nodes has to combine
results.
Main advantages:
Simple and fast!!
simonj@idi.ntnu.no www.iad-centre.no
Document-wise partitioning
A query q is broadcasted to all of the nodes and
executed concurrently.
One of the nodes has to combine
results.
Main problems:
New nodes increase the
overhead!
q disk-seeks
on each node!
all of the nodes are involved"
in processing of each query!
simonj@idi.ntnu.no www.iad-centre.no
Term-wise partitioning
Each node stores a subset of a global index
simonj@idi.ntnu.no www.iad-centre.no
Term-wise partitioning
Each query is divided into a number of sub-queries
Each node fetches the data and sends it to another node,
that receives and processes all
of the posting lists.
simonj@idi.ntnu.no www.iad-centre.no
Term-wise partitioning
Each query is divided into a number of sub-queries
Each node fetches the data and sends it to another node,
that receives and processes all
of the posting lists.
Main advantages:
q disk-seeks in
total!
Fewer network
messages!
With n >> q
several queries
can be executed
concurrently!
Up to q
nodes are
involved. !High throughput
and fault-tolerance!
simonj@idi.ntnu.no www.iad-centre.no
Term-wise partitioning
Each query is divided into a number of sub-queries
Each node fetches the data and sends it to another node,
that receives and processes all
of the posting lists.
Main problems:
All processing is
done by one node!
Other nodes act as advanced
network disks.!
High network
load!
Load balancing is critical!
simonj@idi.ntnu.no www.iad-centre.no
Pipelined query processing (Moffat et al., 2007)
A query-bundle is routed from one node to next.
Each node fetches the posting data, combines it with the
previously accumulated results and
sends these to the next node.
The last node extracts the top results.
The number of accumulators is
limited by a target value L.
(Lester et al., 2005)
simonj@idi.ntnu.no www.iad-centre.no
Pipelined query processing (Moffat et al., 2007)
A query-bundle is routed from one node to next.
Each node fetches the posting data, combines it with the
previously accumulated results and
sends these to the next node.
The last node extracts the top results.
The number of accumulators is
limited by a target value L.
(Lester et al., 2005)
Main advantages:
Work is distributed
between the nodes!
Reduced network load."
L limits the transfer size"
Reduced overhead
on the last node."
simonj@idi.ntnu.no www.iad-centre.no
Pipelined query processing (Moffat et al., 2007)
A query-bundle is routed from one node to next.
Each node fetches the posting data, combines it with the
previously accumulated results and
sends these to the next node.
The last node extracts the top results.
The number of accumulators is
limited by a target value L.
(Lester et al., 2005)
Main problem:
Long query
latency!"
simonj@idi.ntnu.no www.iad-centre.no
Outline
Introduction to distributed inverted indexes
Problem definition and motivation
Our approach
Experimental evaluation
Conclusions
simonj@idi.ntnu.no www.iad-centre.no
Problem definition and motivation
Term-wise partitioning – many interesting properties
and a good potential for improvement.
Pipelined – higher throughput, but longer latency.
Non-pipelined – shorter latency, but lower throughput.
We want to design an approach that combines the
advantages of both methods – short latency AND high
throughput.
simonj@idi.ntnu.no www.iad-centre.no
Scope and limitations
Disk-based document-ordered inverted index.
Index access model and compression methods are
based on the Terrier Search Engine.
Query processing model is based on the approach by
Lester et al.
simonj@idi.ntnu.no www.iad-centre.no
Outline
Introduction to distributed inverted indexes
Problem definition and motivation
Our approach
Experimental evaluation
Conclusions
simonj@idi.ntnu.no www.iad-centre.no
Our observations of pipelined
query processing
1.Sequential disk-access and data processing.
2.Accumulators have a worse compression ratio than postings.
3.For some queries, pipelined processing
might be worse than non-pipelined.
4.Query route may not minimize the network
load.
f(‘quota’)= 44 395!
f(‘rate’)= 1 641 852!
f(‘tariff’)= 80 017!
f(‘sugar’)= 109 253!
1 2 3
simonj@idi.ntnu.no www.iad-centre.no
Our approach
Semi-Pipelined Query Processing
Sequential disk-access and data processing.
Combination Heuristic
Accumulators have a worse compression ratio than postings.
For some queries, pipelined processing might be worse than
non-pipelined.
Alternative Routing Strategy
Query route may not minimize the network load.
simonj@idi.ntnu.no www.iad-centre.no
Semi-Pipelined Query Processing
simonj@idi.ntnu.no www.iad-centre.no
Semi-Pipelined Query Processing
1 2 1 2
simonj@idi.ntnu.no www.iad-centre.no
Combination/Decision Heuristic
For each query, we want to choose between semi- and
non-pipelined processing.
Our decision depends on the upper bound estimate for
the amount of data to be transferred.
We execute a query as non-pipelined when:
simonj@idi.ntnu.no www.iad-centre.no
Alternative Routing Strategy
Instead of routing by increasing least term frequency,
we route by increasing longest posting list length.
‘quota’ 44395/186108"
‘rate’ 1641852/10568900!
‘sugar’ 109253/281569!‘tariff 80017/513121!
Total number of transferred accumulators: 693244 !
Total number of transferred accumulators: 265396 !
L=400000 red – posting list length
blue– term frequency
L– acc.set target value
1"2"3"
simonj@idi.ntnu.no www.iad-centre.no
Outline
Introduction to distributed inverted indexes
Problem definition and motivation
Our approach
Experimental evaluation
Conclusions
simonj@idi.ntnu.no www.iad-centre.no
Evaluation
A modified, distributed, version of the Terrier Search
Engine v2.2.1 (http://terrier.org/)
The 426GB TREC GOV2 Corpus – 25 mil. documents
20000 queries from the Terabyte Track 05 E!ciency
Topics (first 10000 are used as a warm-up)
8 nodes
Two 2.0GHz Intel Quad-Core, 9GB RAM, 16GB SATA HDD on
each node. Gigabit network.
simonj@idi.ntnu.no www.iad-centre.no
Semi-Pipelined Query Processing
20
40
60
80
100
120
140
200 300 400 500 600
Throughput (qps)
Latency (ms)
non-pl
simonj@idi.ntnu.no www.iad-centre.no
Semi-Pipelined Query Processing
20
40
60
80
100
120
140
200 300 400 500 600
Throughput (qps)
Latency (ms)
plnocomp
plcomp
simonj@idi.ntnu.no www.iad-centre.no
Semi-Pipelined Query Processing
20
40
60
80
100
120
140
200 300 400 500 600
Throughput (qps)
Latency (ms)
semi-plnocomp
semi-plcomp
simonj@idi.ntnu.no www.iad-centre.no
Semi-Pipelined Query Processing
20
40
60
80
100
120
140
200 300 400 500 600
Throughput (qps)
Latency (ms)
non-pl
plnocomp
semi-plnocomp
simonj@idi.ntnu.no www.iad-centre.no
Semi-Pipelined Query Processing
20
40
60
80
100
120
140
200 300 400 500 600
Throughput (qps)
Latency (ms)
non-pl
plcomp
semi-plcomp
simonj@idi.ntnu.no www.iad-centre.no
Combination Heuristic
20
40
60
80
100
120
140
200 300 400 500 600
Throughput (qps)
Latency (ms)
non-pl
plcomp
combα = 0.1
combα = 0.2
combα = 0.3
combα = 0.4
combα = 0.5
simonj@idi.ntnu.no www.iad-centre.no
Alternative Routing Strategy
20
40
60
80
100
120
140
200 300 400 500 600
Throughput (qps)
Latency (ms)
non-pl
plcomp
semi-plcomp
altroute+semi-plcomp
simonj@idi.ntnu.no www.iad-centre.no
Combination of the techniques
26%"
32%"
20
40
60
80
100
120
140
200 300 400 500 600
Throughput (qps)
Latency (ms)
non-pl
plcomp
altroute+combα=0.2
simonj@idi.ntnu.no www.iad-centre.no
Outline
Introduction to distributed inverted indexes
Problem definition and motivation
Our approach
Experimental evaluation
Conclusions
simonj@idi.ntnu.no www.iad-centre.no
Conclusions
We have presented
an e!cient alternative to the state-of-the-art methods.
Our method
combines three techniques that minimize latency and maximize
throughput.
Our results
outperform both methods and provide a significant improvement
in the overall throughput/latency ratio.
simonj@idi.ntnu.no www.iad-centre.no
Thank you!
... However, document-wise partitioning demonstrated shorter latency at low multiprogramming levels. In a recent publication [4], we have addressed the problem of the increased query latency due to a strict node-at-a-time execution and presented a semipipelined approach, which combines parallel disk access and decompression and pipelined evaluation. Additionally, we suggested to switch between semi-and non-pipelined processing dependent on the estimated network cost and an alternative query routing strategy. ...
... Additionally, it opens a possibility for dynamic load balancing with low repartitioning overhead and hybrid query processing. Finally, different from [4,5,6], our experiments use two query logs with very different characteristics, and a varied collection size. ...
Conference Paper
Full-text available
Web search engines need to provide high throughput and short query latency. Recent results show that pipelined query processing over a term-wise partitioned inverted index may have superior throughput. However, the query processing latency and scalability with respect to the collections size are the main challenges associated with this method. In this paper, we evaluate the effect of inverted index skipping on the performance of pipelined query processing. Further, we introduce a novel idea of using Max-Score pruning within pipelined query processing and a new term assignment heuristic, partitioning by Max-Score. Our current results indicate a significant improvement over the state-of-the-art approach and lead to several further optimizations, which include dynamic load balancing, intra-query concurrent processing and a hybrid combination between pipelined and non-pipelined execution.
... In some recent publications [5,8], we have addressed the problem of the increased query latency due to a strict node-at-a-time execution and presented a semi-pipelined approach which combines parallel disk access and decompression and pipelined evaluation. Additionally, we suggested to switch between semi-and non-pipelined processing dependent on the estimated network cost and an alternative query routing strategy. ...
... Additionally, it opens a possibility for dynamic load balancing with low repartitioning overhead and hybrid query processing. Different from [5][6][7], our experiments use two query logs with very different characteristics, and a varied collection size. ...
Article
Web search engines need to provide high throughput and short query latency. Recent results show that pipelined query processing over a term-wise partitioned inverted index may have superior throughput. However, the query processing latency and scalability with respect to the collections size are the main challenges associated with this method. In this paper, we evaluate the effect of inverted index skipping on the performance of pipelined query processing. Further, we introduce a novel idea of using Max-Score pruning within pipelined query processing and a new term assignment heuristic, partitioning by Max-Score. Our current results indicate a significant improvement over the state-of-the-art approach and lead to several further optimizations which include dynamic load balancing, intra-query concurrent processing and a hybrid combination between pipelined and non-pipelined execution. Lastly, we show how the state of term-wise partitioning relates to the industry standard document-wise partitioning. Even though there are situations pipelined query processing is advantegous, document-wise partitioning is still the road to follow.
... In distributed search engines, several servers are working together in parallel to provide the answer of a query. In this type of search engines, each server has its own index, does its own search and provides its own ranked answers which will be sent to the broker for further process, in the pipelined form [10,12] the servers themselves create the final results in sequential steps. One main question, however, is how to partition documents and their related indexes across servers to achieve a satisfactory query throughput rate. ...
Article
Full-text available
To provide the most relevant answers to the user’s query in the shortest time, search engines require quick data retrieval mechanism. One of the factors affecting the speed of data retrieval is how the load is distributed among the servers. The mechanism of load distribution between servers and consequently the performance of the search engine is affected by the way data is shared between servers. Document-based distribution and word-based distribution are the two main methods of data sharing, neither of which guarantees a permanent load balance. Existing solutions to improve load balance in both document-based and word-based distribution methods use users’ query history to obtain information about their search pattern. These methods examine queries to identify popular words among users and assign a weight to each one, which indicates the load of that word. The problem is that most of the time, the words with the words that follow them represent the purpose of the user, not alone. By considering words individually, it is possible to assign high weight to words that alone have no value to the user, which can lead to an unfair distribution of load when distributing data between servers. The proposed method tries to improve the data distribution process between the servers and thus the load balance by considering the sequence of constructive words of the queries along with the words and weighting them. The results of the experiments show that the improvement of the load balance of the proposed method is 38.21% on average compared to the document-based distribution method and 35.6% compared to the existing methods for creating a suitable load balance in the document-based distribution method.
... A large body of research work on relevance ranking assume relatively standard processing techniques using an inverted index while large-scale search engines rely on more sophisticated techniques. In addition to the state-of-the-art ranking techniques employed in web search engines, our tutorial covers a variety of problems including query processing on multi-core architectures [48], pipelined query processing [31], response latency prediction [40], selective search [35], query forwarding [21,49], query scheduling [26], power/energy cost efficiency [27,34], multi-site web search [7,25], and green search engines [13]. We also briefly present learning-to-rank optimizations in web search [22,39,50] and discuss the impact of efficiency improvements on user engagement [3,4,11]. ...
Conference Paper
Commercial web search engines need to process thousands of queries every second and provide responses to user queries within a few hundred milliseconds. As a consequence of these tight performance constraints, search engines construct and maintain very large computing infrastructures for crawling the Web, indexing discovered pages, and processing user queries. The scalability and efficiency of these infrastructures require careful performance optimizations in every major component of the search engine. This tutorial aims to provide a fairly comprehensive overview of the scalability and efficiency challenges in large-scale web search engines. In particular, the tutorial provides an in-depth architectural overview of a web search engine, mainly focusing on the web crawling, indexing, and query processing components. The scalability and efficiency issues encountered in these components are presented at four different granularities: at the level of a single computer, a cluster of computers, a single data center, and a multi-center search engine. The tutorial also points out some open research problems and provides recommendations to researchers who are new to the field.
... In our recent work [5] we have presented a combination of parallel posting list prefetching and decompression and pipelined query processing, called semipipelined query processing. We illustrate this approach in Fig. 1(c). ...
Conference Paper
Full-text available
Pipelined query processing over a term-wise distributed inverted index has superior throughput at high query multiprogramming levels. However, due to long query latencies this approach is inefficient at lower levels. In this paper we explore two types of intra-query parallelism within the pipelined approach, parallel execution of a query on different nodes and concurrent execution on the same node. According to the experimental results, our approach reaches the throughput of the state-of-the-art method at about half of the latency. On the single query case the observed latency improvement is up to 2.6 times.
... In our first paper [17], we have addressed four observations of the state-of-the-art PL method: First, non-parallel diskaccesses and bundle processing result in long query latencies. Second, accumulators have less compression potential than inverted list postings. ...
Article
Full-text available
In theory, term-wise partitioned indexes may provide higher throughput than document-wise partitioned. In practice, term-wise partitioning shows lacking scalability with increasing collection size and intra-query parallelism, which leads to long query latency and poor performance at low query loads. In our work, we have developed several techniques to deal with these problems. Our current results show a significant improvement over the state-of-the-art approach on a small distributed IR system, and our next objective is to evaluate the scalability of the improved approach on a large system. In this paper, we describe the relation between our work and the problem of scalability, summarize the results, limitations and challenges of our current work, and outline directions for further research.
Article
Full-text available
In this paper, we describe Terrier, a high performance and scalable search engine that allows the rapid development of large-scale retrieval applications. We focus on the open- source version of the software, which provides a comprehen- sive, exible, robust, and transparent test-bed platform for research and experimentation in Information Retrieval (IR).
Conference Paper
Full-text available
We present a general method of parallel query processing that allows scalable performance on distributed inverted files. The method allows the realization of a hybrid that com- bines the advantages of the document and term partitioned inverted files.
Conference Paper
Full-text available
Large-scale Parallel Web Search Engines (WSEs) needs to adopt a strategy for partitioning the inverted index among a set of parallel server nodes. In this paper we are interested in devising an eective term-partitioning strategy, according to which the global vo- cabulary of terms and the associated inverted lists are split into disjoint subsets, and assigned to distinct servers. Due to the workload imbalance caused by the skewed distribu- tion of terms in user queries, finding an eective partitioning strategy is considered a very complex task. In this paper we first formally introduce Term Partition- ing as a new optimization problem. Then we show how the knowledge mined from past WSE query logs can be prof- itably used to discover good solutions of this problem. Fi- nally, we report many results to show that we are able to eectively reduce both the average number of servers acti- vated per each query, along with the workload imbalance. Experiments are conducted on large query logs of real WSEs.
Conference Paper
Full-text available
In this paper we study three basic and key issues related to Web query processing: load balance, broker behavior, and performance by individual index servers. Our study, while preliminary, does reveal interesting tradeoffs: (1) load unbalance at low query arrival rates can be controlled with a simple measure of randomizing the distribution of documents among the index servers, (2) the broker is not a bottleneck, and (3) disk utilization is higher than CPU utilization.
Conference Paper
Full-text available
Evaluation of ranked queries on large text collections can be costly in terms of processing time and memory space. Dynamic pruning techniques allow both costs to be reduced, at the potential risk of decreased r etrieval effectiveness. In this paper we describe an improved query pruning mechanism that offers a more resilient tradeoff between query evaluation costs and retrieval effectiveness than do previous pruning approaches.
Conference Paper
We consider a digital library distributed in a tightly coupled environment. The library is indexed by inverted files and the vector space model is used as ranking strategy. Using a simple analytical model coupled with a small simulator, we study how query performance is affected by the index organization, the network speed, and the disks transfer rate. Our results, which are based on the Tipster/Trec3 collection, indicate that a global index organization might outperform a local index organization.
Conference Paper
Large-scale web and text retrieval systems deal with amounts of data that greatly exceed the capacity of any single machine. To handle the necessary data volumes and query throughput rates, par- allel systems are used, in which the document and index data are split across tightly-clustered distributed computing sys tems. The index data can be distributed either by document or by term. In this paper we examine methods for load balancing in term-distributed parallel architectures, and propose a suite of techniques f or reduc- ing net querying costs. In combination, the techniques we describe allow a 30% improvement in query throughput when tested on an eight-node parallel computer system.