Top Banner
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented By Guan Guan
17

Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

Dec 13, 2015

Download

Documents

Heather Bridges
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

Building a Distributed Full-Text Index for the Web

by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University

Presented By

Guan Guan

Page 2: Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

Overview

• INTRODUCTION

• TESTBED ARCHITECTURE

• PIPELINED INDEXER DESIGN

• MANAGING INVERTED FILES IN AN EMBEDDED DATABASE SYSTEM

• COLLECTING GLOBAL STATISTICS

• CONCLUSIONS

Page 3: Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

Inverted Index

Book Index Inverted Index

fddf similar

Page 4: Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

Steps to build an inverted index

• Web scale and growth rate

• Rate of change

processing each page to extract postings

sorting the postings first on index terms and then on locations

writing out the sorted postings as a collection of inverted lists on disk

Index build time becomes critical for two reasons:

Page 5: Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

Purpose of The Paper?

• To ptimize build times for massive(web) collections (challenges and solutions).

– Propose a pipeline architecture on each indexing node to enhance performance through intra-node parallelism. (building performance issues)

– Propose an appropriate format for inverted files that makes optimal use of the features of such a database system

– Any distributed system for building inverted indexes needs to address the issue of collecting global statistics (e.g., inverse document frequency - IDF ). We examine different strategies for collecting such statistics from a distributed collection

Page 6: Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

TESTBED ARCHITECTURE

• Distributors. These nodes store the collection of Web pages to be indexed. Pages are gathered by a Web

• Indexers. These nodes execute the core of the index building engine.

• Query servers. Each of these nodes stores a portion of the final inverted index and an associated lexicon. The lexicon lists all the terms in the corresponding portion of the index and their associated statistics.

Overview of indexing process.

Page 7: Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

PIPELINED INDEXER DESIGN

Logic phases

• The core of the indexing system is the index-builder process that executes on each indexer.

Page 8: Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

PIPELINED INDEXER DESIGN

Multi-threaded execution

Performance gain through pipelining save 1.5hours for 5 million pages. 30-40% in general

Page 9: Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

MANAGING INVERTED FILES IN AN EMBEDDED DATABASE SYSTEM

Challenges 1:

Custom Implementation VS existing data management systems

Solution: Berkeley DB

Challenges 2:

designing a scheme for storing inverted files that makes optimal use of the storage structures provided by the data management system.

Full list, Single payload, Mixed list:

Page 10: Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

3 types of schemas:

– 1. Full list: The key is an index term, and the value is the complete inverted list for that term.

– 2. Single payload: Each posting (an index term, location pair) is a separate key.

Page 11: Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

3. Mixed list:

Page 12: Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

Comparison of storage schemes

Index size -- With the mixed list scheme, the length of the value field is approximately constant.

Zig-zag joins -- In the full list scheme, the entire list must be retrieved to compute the join, whereas with the mixed list scheme, access to specific portions of the inverted list is available.

Hot updates -- Since we limit the length of the value field, hot updates are faster with mixed lists than with full lists.

Page 13: Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

Experimental Results

2 million Web pages, 4.9 million distinct terms, 312 million postings

Optimal mixed list 30% better than full list

Page 14: Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

COLLECTING GLOBAL STATISTICS

• ME Strategy (sending local information during merging).

• FL Strategy (sending local information during flushing).

Page 15: Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

Experiments

• In general, experiments show the FL strategy outperforming ME, although they seem to converge as the collection size becomes large. Furthermore, as the collection size grows, the relative overheads of both strategies decrease.

Page 16: Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

CONCLUSIONS

• In this paper we addressed the problem of efficiently constructing inverted indexes over large collections of Web pages.

• We proposed a new pipelining technique to speed up index construction and demonstrated how to identify the right buffer sizes for maximum performance.

• We proposed and compared different schemes for storing and managing inverted files using an embedded database system.

• Finally, we identified the key characteristics of methods for efficiently collecting global statistics from distributed inverted indexes.

Page 17: Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

Q & A