Conference PaperPDF Available

A Combined Semi-pipelined Query Processing Architecture for Distributed Full-Text Retrieval

December 2010

December 2010
6488:587-601

DOI:10.1007/978-3-642-17616-6_51

Source
DBLP

Conference: Web Information Systems Engineering - WISE 2010 - 11th International Conference, Hong Kong, China, December 12-14, 2010. Proceedings

Authors:

Simon Jonassen

Omny

Svein Erik Bratsberg

Norwegian University of Science and Technology

Term-partitioning is an efficient way to distribute a large inverted index. Two fundamentally different query processing approaches are pipelined and non-pipelined. While the pipelined approach provides higher query throughput, the non-pipelined approach provides shorter query latency. In this work we propose a third alternative, combining non-pipelined inverted index access, heuristic decision between pipelined and non-pipelined query execution and an improved query routing strategy. From our results, the method combines the advantages of both approaches and provides high throughput and short query latency. Our method increases the throughput by up to 26% compared to the non-pipelined approach and reduces the latency by up to 32% compared to the pipelined.

Content uploaded by Svein Erik Bratsberg

Content may be subject to copyright.

simonj@idi.ntnu.no www.iad-centre.no

A Combined Semi-Pipelined Query Processing

Architecture For Distributed Full-Text Retrieval!

Simon Jonassen and Svein Erik Bratsberg !

Department of Computer and Information Science!

Norwegian University of Science and Technology"

The 11th International Conference on Web Information System Engineering"

Hong Kong, China"

12-14 December, 2010"

simonj@idi.ntnu.no www.iad-centre.no

Outline

• Introduction to distributed inverted indexes

• Problem definition and motivation

• Our approach

• Experimental evaluation

• Conclusions

simonj@idi.ntnu.no www.iad-centre.no

Inverted index approach to IR

simonj@idi.ntnu.no www.iad-centre.no

Inverted index approach to IR

? ?

simonj@idi.ntnu.no www.iad-centre.no

Document-wise partitioning

• Each node indexes a subset of documents

simonj@idi.ntnu.no www.iad-centre.no

Document-wise partitioning

• A query q is broadcasted to all of the nodes and

executed concurrently.

• One of the nodes has to combine

results.

simonj@idi.ntnu.no www.iad-centre.no

Document-wise partitioning

• A query q is broadcasted to all of the nodes and

executed concurrently.

• One of the nodes has to combine

results.

• Main advantages:

Simple and fast!!

simonj@idi.ntnu.no www.iad-centre.no

Document-wise partitioning

• A query q is broadcasted to all of the nodes and

executed concurrently.

• One of the nodes has to combine

results.

• Main problems:

New nodes increase the

overhead!

q disk-seeks

on each node!

all of the nodes are involved"

in processing of each query!

simonj@idi.ntnu.no www.iad-centre.no

Term-wise partitioning

• Each node stores a subset of a global index

simonj@idi.ntnu.no www.iad-centre.no

Term-wise partitioning

• Each query is divided into a number of sub-queries

• Each node fetches the data and sends it to another node,

that receives and processes all

of the posting lists.

simonj@idi.ntnu.no www.iad-centre.no

Term-wise partitioning

• Each query is divided into a number of sub-queries

• Each node fetches the data and sends it to another node,

that receives and processes all

of the posting lists.

• Main advantages:

q disk-seeks in

total!

Fewer network

messages!

With n >> q

several queries

can be executed

concurrently!

Up to q

nodes are

involved. !High throughput

and fault-tolerance!

simonj@idi.ntnu.no www.iad-centre.no

Term-wise partitioning

• Each query is divided into a number of sub-queries

• Each node fetches the data and sends it to another node,

that receives and processes all

of the posting lists.

• Main problems:

All processing is

done by one node!

Other nodes act as advanced

network disks.!

High network

load!

Load balancing is critical!

simonj@idi.ntnu.no www.iad-centre.no

Pipelined query processing (Moffat et al., 2007)

• A query-bundle is routed from one node to next.

• Each node fetches the posting data, combines it with the

previously accumulated results and

sends these to the next node.

• The last node extracts the top results.

• The number of accumulators is

limited by a target value L.

(Lester et al., 2005)

simonj@idi.ntnu.no www.iad-centre.no

Pipelined query processing (Moffat et al., 2007)

• A query-bundle is routed from one node to next.

• Each node fetches the posting data, combines it with the

previously accumulated results and

sends these to the next node.

• The last node extracts the top results.

• The number of accumulators is

limited by a target value L.

(Lester et al., 2005)

• Main advantages:

Work is distributed

between the nodes!

Reduced network load."

L limits the transfer size"

Reduced overhead

on the last node."

simonj@idi.ntnu.no www.iad-centre.no

Pipelined query processing (Moffat et al., 2007)

• A query-bundle is routed from one node to next.

• Each node fetches the posting data, combines it with the

previously accumulated results and

sends these to the next node.

• The last node extracts the top results.

• The number of accumulators is

limited by a target value L.

(Lester et al., 2005)

• Main problem:

Long query

latency!"

simonj@idi.ntnu.no www.iad-centre.no

Outline

• Introduction to distributed inverted indexes

• Problem definition and motivation

• Our approach

• Experimental evaluation

• Conclusions

simonj@idi.ntnu.no www.iad-centre.no

Problem definition and motivation

• Term-wise partitioning – many interesting properties

and a good potential for improvement.

• Pipelined – higher throughput, but longer latency.

• Non-pipelined – shorter latency, but lower throughput.

• We want to design an approach that combines the

advantages of both methods – short latency AND high

throughput.

simonj@idi.ntnu.no www.iad-centre.no

Scope and limitations

• Disk-based document-ordered inverted index.

• Index access model and compression methods are

based on the Terrier Search Engine.

• Query processing model is based on the approach by

Lester et al.

simonj@idi.ntnu.no www.iad-centre.no

Outline

• Introduction to distributed inverted indexes

• Problem definition and motivation

• Our approach

• Experimental evaluation

• Conclusions

simonj@idi.ntnu.no www.iad-centre.no

Our observations of pipelined

query processing

1. Sequential disk-access and data processing.

2. Accumulators have a worse compression ratio than postings.

3. For some queries, pipelined processing

might be worse than non-pipelined.

4. Query route may not minimize the network

load.

f(‘quota’)= 44 395!

f(‘rate’)= 1 641 852!

f(‘tariff’)= 80 017!

f(‘sugar’)= 109 253!

1 2 3

simonj@idi.ntnu.no www.iad-centre.no

Our approach

• Semi-Pipelined Query Processing

 Sequential disk-access and data processing.

• Combination Heuristic

 Accumulators have a worse compression ratio than postings.

 For some queries, pipelined processing might be worse than

non-pipelined.

• Alternative Routing Strategy

 Query route may not minimize the network load.

simonj@idi.ntnu.no www.iad-centre.no

Semi-Pipelined Query Processing

simonj@idi.ntnu.no www.iad-centre.no

Semi-Pipelined Query Processing

1 2 1 2

simonj@idi.ntnu.no www.iad-centre.no

Combination/Decision Heuristic

• For each query, we want to choose between semi- and

non-pipelined processing.

• Our decision depends on the upper bound estimate for

the amount of data to be transferred.

• We execute a query as non-pipelined when:

simonj@idi.ntnu.no www.iad-centre.no

Alternative Routing Strategy

• Instead of routing by increasing least term frequency,

we route by increasing longest posting list length.

‘quota’ 44395/186108"

‘rate’ 1641852/10568900!

‘sugar’ 109253/281569!‘tariff’ 80017/513121!

Total number of transferred accumulators: 693244 !

Total number of transferred accumulators: 265396 !

L=400000 red – posting list length

blue– term frequency

L– acc.set target value

1"2"3"

simonj@idi.ntnu.no www.iad-centre.no

Outline

• Introduction to distributed inverted indexes

• Problem definition and motivation

• Our approach

• Experimental evaluation

• Conclusions

simonj@idi.ntnu.no www.iad-centre.no

Evaluation

• A modified, distributed, version of the Terrier Search

Engine v2.2.1 (http://terrier.org/)

• The 426GB TREC GOV2 Corpus – 25 mil. documents

•  20000 queries from the Terabyte Track 05 E!ciency

Topics (first 10000 are used as a warm-up)

• 8 nodes

– Two 2.0GHz Intel Quad-Core, 9GB RAM, 16GB SATA HDD on

each node. Gigabit network.

simonj@idi.ntnu.no www.iad-centre.no

Semi-Pipelined Query Processing

100

120

140

200 300 400 500 600

Throughput (qps)

Latency (ms)

non-pl

simonj@idi.ntnu.no www.iad-centre.no

Semi-Pipelined Query Processing

100

120

140

200 300 400 500 600

Throughput (qps)

Latency (ms)

plnocomp

plcomp

simonj@idi.ntnu.no www.iad-centre.no

Semi-Pipelined Query Processing

100

120

140

200 300 400 500 600

Throughput (qps)

Latency (ms)

semi-plnocomp

semi-plcomp

simonj@idi.ntnu.no www.iad-centre.no

Semi-Pipelined Query Processing

100

120

140

200 300 400 500 600

Throughput (qps)

Latency (ms)

non-pl

plnocomp

semi-plnocomp

simonj@idi.ntnu.no www.iad-centre.no

Semi-Pipelined Query Processing

100

120

140

200 300 400 500 600

Throughput (qps)

Latency (ms)

non-pl

plcomp

semi-plcomp

simonj@idi.ntnu.no www.iad-centre.no

Combination Heuristic

100

120

140

200 300 400 500 600

Throughput (qps)

Latency (ms)

non-pl

plcomp

combα = 0.1

combα = 0.2

combα = 0.3

combα = 0.4

combα = 0.5

simonj@idi.ntnu.no www.iad-centre.no

Alternative Routing Strategy

100

120

140

200 300 400 500 600

Throughput (qps)

Latency (ms)

non-pl

plcomp

semi-plcomp

altroute+semi-plcomp

simonj@idi.ntnu.no www.iad-centre.no

Combination of the techniques

26%"

32%"

100

120

140

200 300 400 500 600

Throughput (qps)

Latency (ms)

non-pl

plcomp

altroute+combα=0.2

simonj@idi.ntnu.no www.iad-centre.no

Outline

• Introduction to distributed inverted indexes

• Problem definition and motivation

• Our approach

• Experimental evaluation

• Conclusions

simonj@idi.ntnu.no www.iad-centre.no

Conclusions

• We have presented

– an e!cient alternative to the state-of-the-art methods.

• Our method

–  combines three techniques that minimize latency and maximize

throughput.

• Our results

– outperform both methods and provide a signiﬁcant improvement

in the overall throughput/latency ratio.

simonj@idi.ntnu.no www.iad-centre.no

Thank you!

Improving the Performance of Pipelined Query Processing with Skipping

Conference Paper

Full-text available

Nov 2012

Web search engines need to provide high throughput and short query latency. Recent results show that pipelined query processing over a term-wise partitioned inverted index may have superior throughput. However, the query processing latency and scalability with respect to the collections size are the main challenges associated with this method. In this paper, we evaluate the effect of inverted index skipping on the performance of pipelined query processing. Further, we introduce a novel idea of using Max-Score pruning within pipelined query processing and a new term assignment heuristic, partitioning by Max-Score. Our current results indicate a significant improvement over the state-of-the-art approach and lead to several further optimizations, which include dynamic load balancing, intra-query concurrent processing and a hybrid combination between pipelined and non-pipelined execution.

Improving the performance of pipelined query processing with skipping—and its comparison to document-wise partitioning

Article

Sep 2013

Web search engines need to provide high throughput and short query latency. Recent results show that pipelined query processing over a term-wise partitioned inverted index may have superior throughput. However, the query processing latency and scalability with respect to the collections size are the main challenges associated with this method. In this paper, we evaluate the effect of inverted index skipping on the performance of pipelined query processing. Further, we introduce a novel idea of using Max-Score pruning within pipelined query processing and a new term assignment heuristic, partitioning by Max-Score. Our current results indicate a significant improvement over the state-of-the-art approach and lead to several further optimizations which include dynamic load balancing, intra-query concurrent processing and a hybrid combination between pipelined and non-pipelined execution. Lastly, we show how the state of term-wise partitioning relates to the industry standard document-wise partitioning. Even though there are situations pipelined query processing is advantegous, document-wise partitioning is still the road to follow.

A Query-Based Weighted Document Partitioning Method for Load Balancing in Search Engines

Article

Full-text available

Mar 2023
WIRELESS PERS COMMUN

To provide the most relevant answers to the user’s query in the shortest time, search engines require quick data retrieval mechanism. One of the factors affecting the speed of data retrieval is how the load is distributed among the servers. The mechanism of load distribution between servers and consequently the performance of the search engine is affected by the way data is shared between servers. Document-based distribution and word-based distribution are the two main methods of data sharing, neither of which guarantees a permanent load balance. Existing solutions to improve load balance in both document-based and word-based distribution methods use users’ query history to obtain information about their search pattern. These methods examine queries to identify popular words among users and assign a weight to each one, which indicates the load of that word. The problem is that most of the time, the words with the words that follow them represent the purpose of the user, not alone. By considering words individually, it is possible to assign high weight to words that alone have no value to the user, which can lead to an unfair distribution of load when distributing data between servers. The proposed method tries to improve the data distribution process between the servers and thus the load balance by considering the sequence of constructive words of the queries along with the words and weighting them. The results of the experiments show that the improvement of the load balance of the proposed method is 38.21% on average compared to the document-based distribution method and 35.6% compared to the existing methods for creating a suitable load balance in the document-based distribution method.

Scalability and Efficiency Challenges in Large-Scale Web Search Engines

Conference Paper

Jul 2016

Commercial web search engines need to process thousands of queries every second and provide responses to user queries within a few hundred milliseconds. As a consequence of these tight performance constraints, search engines construct and maintain very large computing infrastructures for crawling the Web, indexing discovered pages, and processing user queries. The scalability and efficiency of these infrastructures require careful performance optimizations in every major component of the search engine. This tutorial aims to provide a fairly comprehensive overview of the scalability and efficiency challenges in large-scale web search engines. In particular, the tutorial provides an in-depth architectural overview of a web search engine, mainly focusing on the web crawling, indexing, and query processing components. The scalability and efficiency issues encountered in these components are presented at four different granularities: at the level of a single computer, a cluster of computers, a single data center, and a multi-center search engine. The tutorial also points out some open research problems and provides recommendations to researchers who are new to the field.

Intra-query Concurrent Pipelined Processing for Distributed Full-Text Retrieval

Conference Paper

Full-text available

Apr 2012

Pipelined query processing over a term-wise distributed inverted index has superior throughput at high query multiprogramming levels. However, due to long query latencies this approach is inefficient at lower levels. In this paper we explore two types of intra-query parallelism within the pipelined approach, parallel execution of a query on different nodes and concurrent execution on the same node. According to the experimental results, our approach reaches the throughput of the state-of-the-art method at about half of the latency. On the single query case the observed latency improvement is up to 2.6 times.

Scalable search platform: Improving pipelined query processing for distributed full-text retrieval

Article

Full-text available

Apr 2012

Simon Jonassen

In theory, term-wise partitioned indexes may provide higher throughput than document-wise partitioned. In practice, term-wise partitioning shows lacking scalability with increasing collection size and intra-query parallelism, which leads to long query latency and poor performance at low query loads. In our work, we have developed several techniques to deal with these problems. Our current results show a significant improvement over the state-of-the-art approach on a small distributed IR system, and our next objective is to evaluate the scalability of the improved approach on a large system. In this paper, we describe the relation between our work and the problem of scalability, summarize the results, limitations and challenges of our current work, and outline directions for further research.

Terrier: A High Performance and Scalable Information Retrieval Platform

Article

Full-text available

Jan 2006

In this paper, we describe Terrier, a high performance and scalable search engine that allows the rapid development of large-scale retrieval applications. We focus on the open- source version of the software, which provides a comprehen- sive, exible, robust, and transparent test-bed platform for research and experimentation in Information Retrieval (IR).

High-performance distributed inverted files

Conference Paper

Full-text available

Nov 2007

We present a general method of parallel query processing that allows scalable performance on distributed inverted files. The method allows the realization of a hybrid that com- bines the advantages of the document and term partitioned inverted files.

Distributed Query Processing Using Partitioned Inverted Files.

Conference Paper

Full-text available

Jan 2001

Not Available

Mining Query Logs to Optimize Index Partitioning in Parallel Web Search Engines

Conference Paper

Full-text available

Jan 2007

Large-scale Parallel Web Search Engines (WSEs) needs to adopt a strategy for partitioning the inverted index among a set of parallel server nodes. In this paper we are interested in devising an eective term-partitioning strategy, according to which the global vo- cabulary of terms and the associated inverted lists are split into disjoint subsets, and assigned to distinct servers. Due to the workload imbalance caused by the skewed distribu- tion of terms in user queries, finding an eective partitioning strategy is considered a very complex task. In this paper we first formally introduce Term Partition- ing as a new optimization problem. Then we show how the knowledge mined from past WSE query logs can be prof- itably used to discover good solutions of this problem. Fi- nally, we report many results to show that we are able to eectively reduce both the average number of servers acti- vated per each query, along with the workload imbalance. Experiments are conducted on large query logs of real WSEs.

Basic issues on the processing of web queries

Conference Paper

Full-text available

Aug 2005

In this paper we study three basic and key issues related to Web query processing: load balance, broker behavior, and performance by individual index servers. Our study, while preliminary, does reveal interesting tradeoffs: (1) load unbalance at low query arrival rates can be controlled with a simple measure of randomizing the distribution of documents among the index servers, (2) the broker is not a bottleneck, and (3) disk utilization is higher than CPU utilization.

Space-Limited Ranked Query Evaluation Using Adaptive Pruning

Conference Paper

Full-text available

Nov 2005
Lect Notes Comput Sci

Evaluation of ranked queries on large text collections can be costly in terms of processing time and memory space. Dynamic pruning techniques allow both costs to be reduced, at the potential risk of decreased r etrieval effectiveness. In this paper we describe an improved query pruning mechanism that offers a more resilient tradeoff between query evaluation costs and retrieval effectiveness than do previous pruning approaches.

Human Behavior and The Principle of Least Effort

Article

Jan 1949

G. K. Zipf

Sync/Async parallel search for the efficient design and construction of web search engines

Article

Apr 2010
PARALLEL COMPUT

Query Performance for Tightly Coupled Distributed Digital Libraries.

Conference Paper

Jan 1998

We consider a digital library distributed in a tightly coupled environment. The library is indexed by inverted files and the vector space model is used as ranking strategy. Using a simple analytical model coupled with a small simulator, we study how query performance is affected by the index organization, the network speed, and the disks transfer rate. Our results, which are based on the Tipster/Trec3 collection, indicate that a global index organization might outperform a local index organization.

Load balancing for term-distributed parallel retrieval

Conference Paper

Aug 2006

Large-scale web and text retrieval systems deal with amounts of data that greatly exceed the capacity of any single machine. To handle the necessary data volumes and query throughput rates, par- allel systems are used, in which the document and index data are split across tightly-clustered distributed computing sys tems. The index data can be distributed either by document or by term. In this paper we examine methods for load balancing in term-distributed parallel architectures, and propose a suite of techniques f or reduc- ing net querying costs. In combination, the techniques we describe allow a 30% improvement in query throughput when tested on an eight-node parallel computer system.

A Combined Semi-pipelined Query Processing Architecture for Distributed Full-Text Retrieval

Abstract

Recommended publications

Multi-Root, Multi-Query Processing in Sensor Networks

Efficient Reverse Top-k Boolean Spatial Keyword Queries on Road Networks

Optimal Multi-Step k-Nearest Neighbor Search

6. Fuzzy Query Processing in the Distributed Relational Databases Environment