Content uploaded by Svein Erik Bratsberg
Author content
All content in this area was uploaded by Svein Erik Bratsberg on Aug 07, 2014
Content may be subject to copyright.
simonj@idi.ntnu.no www.iad-centre.no
A Combined Semi-Pipelined Query Processing
Architecture For Distributed Full-Text Retrieval!
Simon Jonassen and Svein Erik Bratsberg !
Department of Computer and Information Science!
Norwegian University of Science and Technology"
The 11th International Conference on Web Information System Engineering"
Hong Kong, China"
12-14 December, 2010"
simonj@idi.ntnu.no www.iad-centre.no
Outline
• Introduction to distributed inverted indexes
• Problem definition and motivation
• Our approach
• Experimental evaluation
• Conclusions
simonj@idi.ntnu.no www.iad-centre.no
Inverted index approach to IR
© apple.com
simonj@idi.ntnu.no www.iad-centre.no
Inverted index approach to IR
© apple.com
?
? ?
? ?
simonj@idi.ntnu.no www.iad-centre.no
Document-wise partitioning
• Each node indexes a subset of documents
simonj@idi.ntnu.no www.iad-centre.no
Document-wise partitioning
• A query q is broadcasted to all of the nodes and
executed concurrently.
• One of the nodes has to combine
results.
simonj@idi.ntnu.no www.iad-centre.no
Document-wise partitioning
• A query q is broadcasted to all of the nodes and
executed concurrently.
• One of the nodes has to combine
results.
• Main advantages:
Simple and fast!!
simonj@idi.ntnu.no www.iad-centre.no
Document-wise partitioning
• A query q is broadcasted to all of the nodes and
executed concurrently.
• One of the nodes has to combine
results.
• Main problems:
New nodes increase the
overhead!
q disk-seeks
on each node!
all of the nodes are involved"
in processing of each query!
simonj@idi.ntnu.no www.iad-centre.no
Term-wise partitioning
• Each node stores a subset of a global index
simonj@idi.ntnu.no www.iad-centre.no
Term-wise partitioning
• Each query is divided into a number of sub-queries
• Each node fetches the data and sends it to another node,
that receives and processes all
of the posting lists.
simonj@idi.ntnu.no www.iad-centre.no
Term-wise partitioning
• Each query is divided into a number of sub-queries
• Each node fetches the data and sends it to another node,
that receives and processes all
of the posting lists.
• Main advantages:
q disk-seeks in
total!
Fewer network
messages!
With n >> q
several queries
can be executed
concurrently!
Up to q
nodes are
involved. !High throughput
and fault-tolerance!
simonj@idi.ntnu.no www.iad-centre.no
Term-wise partitioning
• Each query is divided into a number of sub-queries
• Each node fetches the data and sends it to another node,
that receives and processes all
of the posting lists.
• Main problems:
All processing is
done by one node!
Other nodes act as advanced
network disks.!
High network
load!
Load balancing is critical!
simonj@idi.ntnu.no www.iad-centre.no
Pipelined query processing (Moffat et al., 2007)
• A query-bundle is routed from one node to next.
• Each node fetches the posting data, combines it with the
previously accumulated results and
sends these to the next node.
• The last node extracts the top results.
• The number of accumulators is
limited by a target value L.
(Lester et al., 2005)
simonj@idi.ntnu.no www.iad-centre.no
Pipelined query processing (Moffat et al., 2007)
• A query-bundle is routed from one node to next.
• Each node fetches the posting data, combines it with the
previously accumulated results and
sends these to the next node.
• The last node extracts the top results.
• The number of accumulators is
limited by a target value L.
(Lester et al., 2005)
• Main advantages:
Work is distributed
between the nodes!
Reduced network load."
L limits the transfer size"
Reduced overhead
on the last node."
simonj@idi.ntnu.no www.iad-centre.no
Pipelined query processing (Moffat et al., 2007)
• A query-bundle is routed from one node to next.
• Each node fetches the posting data, combines it with the
previously accumulated results and
sends these to the next node.
• The last node extracts the top results.
• The number of accumulators is
limited by a target value L.
(Lester et al., 2005)
• Main problem:
Long query
latency!"
simonj@idi.ntnu.no www.iad-centre.no
Outline
• Introduction to distributed inverted indexes
• Problem definition and motivation
• Our approach
• Experimental evaluation
• Conclusions
simonj@idi.ntnu.no www.iad-centre.no
Problem definition and motivation
• Term-wise partitioning – many interesting properties
and a good potential for improvement.
• Pipelined – higher throughput, but longer latency.
• Non-pipelined – shorter latency, but lower throughput.
• We want to design an approach that combines the
advantages of both methods – short latency AND high
throughput.
simonj@idi.ntnu.no www.iad-centre.no
Scope and limitations
• Disk-based document-ordered inverted index.
• Index access model and compression methods are
based on the Terrier Search Engine.
• Query processing model is based on the approach by
Lester et al.
simonj@idi.ntnu.no www.iad-centre.no
Outline
• Introduction to distributed inverted indexes
• Problem definition and motivation
• Our approach
• Experimental evaluation
• Conclusions
simonj@idi.ntnu.no www.iad-centre.no
Our observations of pipelined
query processing
1. Sequential disk-access and data processing.
2. Accumulators have a worse compression ratio than postings.
3. For some queries, pipelined processing
might be worse than non-pipelined.
4. Query route may not minimize the network
load.
f(‘quota’)= 44 395!
f(‘rate’)= 1 641 852!
f(‘tariff’)= 80 017!
f(‘sugar’)= 109 253!
1 2 3
simonj@idi.ntnu.no www.iad-centre.no
Our approach
• Semi-Pipelined Query Processing
Sequential disk-access and data processing.
• Combination Heuristic
Accumulators have a worse compression ratio than postings.
For some queries, pipelined processing might be worse than
non-pipelined.
• Alternative Routing Strategy
Query route may not minimize the network load.
simonj@idi.ntnu.no www.iad-centre.no
Semi-Pipelined Query Processing
simonj@idi.ntnu.no www.iad-centre.no
Semi-Pipelined Query Processing
1 2 1 2
simonj@idi.ntnu.no www.iad-centre.no
Combination/Decision Heuristic
• For each query, we want to choose between semi- and
non-pipelined processing.
• Our decision depends on the upper bound estimate for
the amount of data to be transferred.
• We execute a query as non-pipelined when:
simonj@idi.ntnu.no www.iad-centre.no
Alternative Routing Strategy
• Instead of routing by increasing least term frequency,
we route by increasing longest posting list length.
‘quota’ 44395/186108"
‘rate’ 1641852/10568900!
‘sugar’ 109253/281569!‘tariff’ 80017/513121!
Total number of transferred accumulators: 693244 !
Total number of transferred accumulators: 265396 !
L=400000 red – posting list length
blue– term frequency
L– acc.set target value
1"2"3"
simonj@idi.ntnu.no www.iad-centre.no
Outline
• Introduction to distributed inverted indexes
• Problem definition and motivation
• Our approach
• Experimental evaluation
• Conclusions
simonj@idi.ntnu.no www.iad-centre.no
Evaluation
• A modified, distributed, version of the Terrier Search
Engine v2.2.1 (http://terrier.org/)
• The 426GB TREC GOV2 Corpus – 25 mil. documents
• 20000 queries from the Terabyte Track 05 E!ciency
Topics (first 10000 are used as a warm-up)
• 8 nodes
– Two 2.0GHz Intel Quad-Core, 9GB RAM, 16GB SATA HDD on
each node. Gigabit network.
simonj@idi.ntnu.no www.iad-centre.no
Semi-Pipelined Query Processing
20
40
60
80
100
120
140
200 300 400 500 600
Throughput (qps)
Latency (ms)
non-pl
simonj@idi.ntnu.no www.iad-centre.no
Semi-Pipelined Query Processing
20
40
60
80
100
120
140
200 300 400 500 600
Throughput (qps)
Latency (ms)
plnocomp
plcomp
simonj@idi.ntnu.no www.iad-centre.no
Semi-Pipelined Query Processing
20
40
60
80
100
120
140
200 300 400 500 600
Throughput (qps)
Latency (ms)
semi-plnocomp
semi-plcomp
simonj@idi.ntnu.no www.iad-centre.no
Semi-Pipelined Query Processing
20
40
60
80
100
120
140
200 300 400 500 600
Throughput (qps)
Latency (ms)
non-pl
plnocomp
semi-plnocomp
simonj@idi.ntnu.no www.iad-centre.no
Semi-Pipelined Query Processing
20
40
60
80
100
120
140
200 300 400 500 600
Throughput (qps)
Latency (ms)
non-pl
plcomp
semi-plcomp
simonj@idi.ntnu.no www.iad-centre.no
Combination Heuristic
20
40
60
80
100
120
140
200 300 400 500 600
Throughput (qps)
Latency (ms)
non-pl
plcomp
combα = 0.1
combα = 0.2
combα = 0.3
combα = 0.4
combα = 0.5
simonj@idi.ntnu.no www.iad-centre.no
Alternative Routing Strategy
20
40
60
80
100
120
140
200 300 400 500 600
Throughput (qps)
Latency (ms)
non-pl
plcomp
semi-plcomp
altroute+semi-plcomp
simonj@idi.ntnu.no www.iad-centre.no
Combination of the techniques
26%"
32%"
20
40
60
80
100
120
140
200 300 400 500 600
Throughput (qps)
Latency (ms)
non-pl
plcomp
altroute+combα=0.2
simonj@idi.ntnu.no www.iad-centre.no
Outline
• Introduction to distributed inverted indexes
• Problem definition and motivation
• Our approach
• Experimental evaluation
• Conclusions
simonj@idi.ntnu.no www.iad-centre.no
Conclusions
• We have presented
– an e!cient alternative to the state-of-the-art methods.
• Our method
– combines three techniques that minimize latency and maximize
throughput.
• Our results
– outperform both methods and provide a significant improvement
in the overall throughput/latency ratio.
simonj@idi.ntnu.no www.iad-centre.no
Thank you!