Top Banner
03/20/2003 Parallel IR 1 Papers on Parallel IR Agenda •Introduction •Paper 1:Inverted file partitioning schemes in multiple disk systems •Paper 2: Parallel search using partitioned inverted files •Comparison •Conclusion •URL Links to Paper
23

03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

Dec 14, 2015

Download

Documents

Marianna End
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

03/20/2003 Parallel IR 1

Papers on Parallel IR

Agenda•Introduction•Paper 1:Inverted file partitioning schemes in multiple disk systems•Paper 2: Parallel search using partitioned inverted files•Comparison •Conclusion•URL Links to Paper

Page 2: 03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

03/20/2003 Parallel IR 2

Parallel IR Introduction

Parallelism in Query processing involves:1. Multitasking Simultaneous Queries

A thread or process for each user query, that can execute on a CPU

The same thread or process completes an entire single query

Ability to handle multiple concurrent queries2. Query Partitioning

A single query is broken into sub tasks Each sub task can run in parallelImproves Response Time of a single Query

Page 3: 03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

03/20/2003 Parallel IR 3

Partitioning Query into Sub Tasks

• IR involves dealing with large amounts of data. Hence we can partition data set between sub tasks– Document Partitioning

• Divides documents over sub tasks, so that each sub task processes a sub set of the documents

– Term Partitioning

• Divides the indexing terms among sub tasks so that each document processing is spread out between sub tasks

Page 4: 03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

03/20/2003 Parallel IR 4

Theme of Papers being presented….

• Both the papers explore the issues and performance implications in parallel IR systems using inverted indexes when they employ

– A) Document Partitioning

– B) Index Term Partitioning

• Paper1: Inverted file partitioning schemes in multiple disk systems

• Paper2: Parallel search using partitioned inverted files

Page 5: 03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

03/20/2003 Parallel IR 5

P1: Inverted File Systems

• Inverted File System consists of:– Index File: Ordered list of all keywords that have been used

to index a collection of documents. Along with each term there are fields that give the location and number of postings in the posting file

– Posting File: consists of a group of records, with each record having the weight of the term and a pointer to the actual document file

– Document File: contains the actual document records of the collection

Page 6: 03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

03/20/2003 Parallel IR 6

P1: Inverted File Systems ( cont )

Page 7: 03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

03/20/2003 Parallel IR 7

P1: Load Balancing

In a multiple CPU, multiple disk system we need to:• Balance the Load on Processors

– Need to maximize CPU utilization

• Balance the Load on the I/O devices i.e. disk drives– Avoid I/O bottle necks which will cause CPUs to

go in wait states

Page 8: 03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

03/20/2003 Parallel IR 8

P1:Partitioning an Inverted File

The paper explores the 2 schemes: – Based on Term Id

– Based on Document Id

• With Both the schemes partitioning of the index file and the document file is the same – Index File by index term id and document file by document id

• We have seen that the posting file has both the document id as well as the index term id. One scheme partitions the posting file based on the Term Id while the other partitions it based on the document id.

Page 9: 03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

03/20/2003 Parallel IR 9

P1:Partitioning an Inverted File ( cont)

Page 10: 03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

03/20/2003 Parallel IR 10

P1: Objective of Partitioning Inverted Index

• Objective: To maximize performance• Ideal: All I/O channels and Disk drives are equally

used when sub tasks of a query gets executed in parallel

• However Data usage is dynamic from query to query and cannot be predicted. Hence we cannot achieve the ideal limit

• Paper recognizes that I/O is a major cost factor in IR

Page 11: 03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

03/20/2003 Parallel IR 11

A Brief Comparison

Document Id Term Id

All posting entries of a document are on the same disk

All posting entries of an index term are on the same disk

The index file needs to store the disk information with the index term, to indicate where the posting entries are stored. Hence requires more space

No need as all posting entries of a index term are on the same disk. Less space usage

Disk space usage over the multiple disks is balanced

Since posting size of a Index Term varies with the frequency of occurrence in the collection, disk space usage may be unbalanced

Page 12: 03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

03/20/2003 Parallel IR 12

A Brief Comparison…

• The Main Important Difference:

Different I/O characteristic:

A sub task of a single query index term will lead to disk I/O distribution across multiple disks in DocumentId partitioning while with TermId is limited to one disk.

Which is better? – It is a tradeoff………

Page 13: 03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

03/20/2003 Parallel IR 13

P1: Simulation Model

• To compare the two schemes the paper defines a simulation model with the following factors:

a) Collection Database Model – follows natural language text distribution following Zipfs law. 20% of index terms comprise 80% of posting entries. Model Skews the above ratios to observe the effect on query performance

b) User Query Model : The paper used two cases. Skewed queries, with some terms of low ranks frequently requested. Uniform query model with al terms having same probability

Page 14: 03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

03/20/2003 Parallel IR 14

P1: Simulation Model.. Cont..

c) Queuing Model: Concurrent I/O requests on the same device are queued in priority. CPU usage requests on the same CPU are also queued

d) Work Load Model : Vary the number of disks and CPUs

Page 15: 03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

03/20/2003 Parallel IR 15

Simulation Results

• Increasing the number of disks up to a threshold improves performance, by decreasing the response time

• When the index term and the query term distribution is not skewed partitioning scheme based on term id performed the best

• When data was skewed, partitioning scheme based on document id performed the best. With skewed data (80/20) and with TermId, disks with those 20% of terms will become bottlenecks

Page 16: 03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

03/20/2003 Parallel IR 16

Paper 2 - Positioning w.r.t. Paper 1

• The thrust of paper 1’s approach was to partition the user queries by index terms, with each index term query becoming a sub task. The objective then became to optimize the one individual sub task with the biggest bottle next of I/O

• What if user query has only one query index term!!! Your disks are optimized, but your CPUs are idle

• Paper 2 recognizes that most user queries are single term only. Why?

Page 17: 03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

03/20/2003 Parallel IR 17

P2: Search Topology Framework

• P2’s proposes a different framework:

Page 18: 03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

03/20/2003 Parallel IR 18

P2: Search Topology ( Cont..)

• Top Node: Accepts query from client and distributes it to all of its child nodes and awaits results.

• Leaf Node: Looks after only ONE PARTITION of the inverted file. Each leaf node and the top node have a processor each.

Within this framework the papers objective is to evaluate which type of inverted index partitioning is better: DocId or TermId based.

Page 19: 03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

03/20/2003 Parallel IR 19

P2: Approach

• The paper uses real web collections instead of simulations for experimentations

• The PLIERS system is used on a 8 to 12 nodes AP3000 m/c.

• The data used comprised BASE1(1Gb) to BASE10(10Gb) of VLC2 collection

• Queries were based on topics 351 to 400 of the TREC-7 ad-hoc track.

• Title only and whole topic queries were used

• DocId and TermId index partitioning was used

• Bottom Line: Real Data instead of simulation

Page 20: 03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

03/20/2003 Parallel IR 20

P2: Summary of Results

Within the framework of the experiment:

• DocId partitioning is better in a multiprocessor environment, than TermId Partitioning

• TermId approach imposes too much communication overhead between leafs and the top node as the final result for a given doc, depends on the results from each leaf node

Page 21: 03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

03/20/2003 Parallel IR 21

Comparison

Paper 1 Paper 2

Breaks queries into sub tasks based on query keywords

Breaks query into sub tasks based on number of partitions of inverted index.

Focus on optimization of disk I/O access

Focus on optimization of processor use

Assumes a more generic Topological Framework

Very specific framework. Total number of CPUs needed depend on data driven partitions!

Concludes results of plus and minus of docId and TermId partitioning schemes based on properties of document collection

Due to specific framework assumptions, came to the conclusion that DocId partitioning method for inverted index is best, in that framework

Page 22: 03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

03/20/2003 Parallel IR 22

Conclusion

In combination these 2 papers highlight the issues of processor and I/O utilizations, in context to the factors affecting partitioning inverted indexes, in DocumentId and TermId Schemes

Page 23: 03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

03/20/2003 Parallel IR 23

URL Links to Paper

Paper 1: Inverted file partitioning schemes in multiple disk systems Byeong-Soo Jeong; Omiecinski, E.; Parallel and Distributed Systems, IEEE Transactions on , Volume: 6 Issue: 2 , Feb 1995 http://ieeexplore.ieee.org/iel4/71/8001/00342125.pdf?isNumber=8001&prod=IEEE+JNL&arnumber=342125&arSt=142&ared=153&arAuthor=Byeong-Soo+Jeong%3B+Omiecinski%2C+E.%3B

Paper 2: Parallel search using partitioned inverted files

MacFarlane, A.; McCann, J.A.; Robertson, S.E.; String Processing and Information Retrieval, 2000. SPIRE 2000. Proceedings. Seventh International Symposium on , 2000

http://ieeexplore.ieee.org/iel5/7055/19010/00878197.pdf?isNumber=19010&prod=IEEE+CNF&arnumber=878197&arSt=209&ared=220&arAuthor=MacFarlane%2C+A.%3B+McCann%2C+J.A.%3B+Robertson%2C+S.E.%3B