Top Banner
Fast Phrase Querying With Combined Indexes HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE RMIT University 2004 Burak Görener 201195001 Doğuş University
36

Fast Phrase Querying With Combined Indexes

Dec 31, 2015

Download

Documents

earlene-marnell

Fast Phrase Querying With Combined Indexes. HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE RMIT University 2004 Burak Görener 201195001 Doğuş University. Search Engines. Need to evaluate queries extremely fast. Involve phrases. Supported with low disk overheads. Introduction. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Fast Phrase Querying With Combined Indexes

Fast Phrase Querying With CombinedIndexes

HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE

RMIT University

2004

Burak Görener 201195001

Doğuş University

Page 2: Fast Phrase Querying With Combined Indexes

Search Engines . . .

Need to evaluate queries extremely fast.

Involve phrases.

Supported with low disk overheads.

Page 3: Fast Phrase Querying With Combined Indexes

Introduction

Most queries consist of simple list of words.

Some of query terms must be ordered and adjacent.

Typically by enclosing and in quotation mark. Standart way to evaluate phrase queries to use inverted index.

Inverted Index(II) use List of posting (each posting include a document ID )

List of offsets.(ordinal word position) II work with combinating the posting list for the query

terms occurs in the documents. This process is fast but does not mean!

Because of common words.

Page 4: Fast Phrase Querying With Combined Indexes

Introduction Cont.

A common term require several megabytes for each GB of Inverted Index's Data.

A crude solution is to use stopping The Google neglected common words in phrase queries until

2002

Until this, many more queries evaluated incorrectly.

Page 5: Fast Phrase Querying With Combined Indexes

Introduction Cont.

A Nextword index is like a Inverted Index

Nextword index use Index term(firstword and nextword)

Nextword index work Each index term(firstword) is a list of the words(nextword) that

follow that term. Firstword and nextword occur as a pair.

As a disadvantages is its storage size. Must be processed linearly(Nextword process).

With direct indexing, indexed 10 k most common phase queries reduces query evalution time by over %10.

Page 6: Fast Phrase Querying With Combined Indexes

Next . . .

Introduction (Fin)

Properties of Phrase Queries

Inverted Index in Phrase Queries

Partial Phrase and Nextword Indexing

Combining Phrase and Inverted Indexing

Experimental Result

Conclusion

Page 7: Fast Phrase Querying With Combined Indexes

Properties of Queries

In this research, used query logs by Excite from 1997 and 1999

These logs have similar properties. 1.583.922 queries including duplicates. % 8.3 of these were explicit phrase queries. In totaly, %5-10 are explicit.

Queries matched in an around 20 GB Web dataset. Pharses queries , 11.103 or % 8.4 include one of three

common words as the, to and of. In totaly, %14.4 of phase queries include one of 20 commonest

terms.

Page 8: Fast Phrase Querying With Combined Indexes

Properties of Queries

In this research, used query logs by Excite from 1997 and 1999

These logs have similar properties. 1.583.922 queries including duplicates. % 8.3 of these were explicit phrase queries. In totaly, %5-10 are explicit.

Queries matched in an around 20 GB Web dataset. Pharses queries , 11.103 or % 8.4 include one of three

common words as the, to and of. In totaly, %14.4 of phase queries include one of 20 commonest

terms.

Page 9: Fast Phrase Querying With Combined Indexes

Properties of Queries

Common words played important role!

In tower of london, can be safely neglected during evalution.

But in the spacial name like movie name or brand name End of days or The who

These queries are diffucult to evaluate with stopwords removed.

Also query logs include; To be or not to be Who are we All in all

Page 10: Fast Phrase Querying With Combined Indexes

Properties of Queries

Stopping may yield efficiency gain,

But, significant number of queries cannot be correctly evaluated.

Basic query is tower of london, it is evaluated as tower – london Stopped first 3 commenest word Result 309 x 10^6 matches Stopped first 20 commenest word Result 490 x 10^6 matches Stopped first 254 commenest word Result 1693 x 10^6 matches

Most mixed problem in form and to.

Dismathes flights from london and flights to london

Page 11: Fast Phrase Querying With Combined Indexes

Properties of Queries

Other dismathes examples; So many roads ->how many road Man in the moon -> man on the moon

Among the phase queries include,

Generaly 2 words. %34 in 3 words. %1.3 in 6 or more word.

Page 12: Fast Phrase Querying With Combined Indexes

Properties of Queries

Testing Data

Called WT10g collection. This is 10.27 GB Web data (HTML) and 1.67 million doc. It is crawed in 1997

Page 13: Fast Phrase Querying With Combined Indexes

Most Frequent Words and Word Pairs

Page 14: Fast Phrase Querying With Combined Indexes

Next . . .

Introduction (Fin)

Properties of Phrase Queries (Fin)

Inverted Index in Phrase Queries

Partial Phrase and Nextword Indexing

Combining Phrase and Inverted Indexing

Experimental Result

Conclusion

Page 15: Fast Phrase Querying With Combined Indexes

Inverted Index

It is a standart method for supporting queries on large text DB.

It is fast for ranked query evalution.

It use two level structure

Upper level is a vocabulary or lexicon Lower level is set of posting list.

Zobel and Moffat (1998) notation;

D is document ID F dt frequent of term indocument D OX is position of term in document D

Page 16: Fast Phrase Querying With Combined Indexes

Inverted Index

Let's look "hatful of hollow"

This is general structure of Inverted Index

Term and Document frequences contain in it.

Word positions are ordinal.

Page 17: Fast Phrase Querying With Combined Indexes

Inverted Index

Inverted Index Evaluator

It is open source MG text retrival engine Descirebed by Witten et al.(1999)

Inverted Index data size for WT10g is 1,429 MB

Stopped word data size is 427 MB (490 stopwords) Stopped Inverted Index size is 1,002 MB

Page 18: Fast Phrase Querying With Combined Indexes

Inverted Index

Result of Inverted Index performing

Page 19: Fast Phrase Querying With Combined Indexes

Next . . .

Introduction (Fin)

Properties of Phrase Queries (Fin)

Inverted Index in Phrase Queries (Fin)

Partial Phrase and Nextword Indexing

Combining Phrase and Inverted Indexing

Experimental Result

Conclusion

Page 20: Fast Phrase Querying With Combined Indexes

Phrase Indexes

Phase Index is an Inverted Index where items stored as a word sequence.

A parcial phrase index with a vocabulary of five popular phrases.

Page 21: Fast Phrase Querying With Combined Indexes

Phrase Indexes

A phrase index with L = 3 cannot be used efficient to 2 word queries

L=> 2 are stored as term in conventional inverted index. L= 2 is organized for partial nextword indexes.

Parcial Phrase Index

It is notation like;

D is document ID, f dp is term frequence of document. Offsets are not stored. The sets saves the cost of merging lists.

Page 22: Fast Phrase Querying With Combined Indexes

Phrase Indexes

As examples are

Lord of the rings(19) and birtney spears(59)* in 2001 Given a stream of queries over a long period and fixed volume

of memory

May also be required to update the vocabulary or replace least frequently used queries.

This research do not experiment with this approach.

* is number of same request(Query)

Page 23: Fast Phrase Querying With Combined Indexes

Nextword Indexes

A phrase query can never be less than two word.

Nextword index is similar to inverted index.

Term representation;

F wp is document frequence. D is document ID. F dwp is frequent of term of D. OX is position of term in D.

Page 24: Fast Phrase Querying With Combined Indexes

Nextword Indexes A nextword index with two firstwords.

An example : boulder municipal employee credit union

This can be grouped like boulder-municipal,employee-credit and credit-union

Other example : historical railroads in new hamsphire

It can grouped as railroads in in preferences to in new AS railroad is much less common than in.

Page 25: Fast Phrase Querying With Combined Indexes

Nextword Indexes

The nextword index for the WT10g collection is 2.75 GB in size.

It is exactly twice that of an inverted index file. The nextword index involves more complex structures than does

processing with inverted index.

Differences between Inverted Index and Nextword Index in queries

Page 26: Fast Phrase Querying With Combined Indexes

Next . . .

Introduction (Fin)

Properties of Phrase Queries (Fin)

Inverted Index in Phrase Queries (Fin)

Partial Phrase and Nextword Indexing (Fin)

Combining Phrase and Inverted Indexing

Experimental Result

Conclusion

Page 27: Fast Phrase Querying With Combined Indexes

Combining Nextword and Inverted Indexing

Propose that common words only be used as firstword in a parcial nextword index.

Page 28: Fast Phrase Querying With Combined Indexes

Combining Phrase and Inverted Indexing

As an example, the query is new york city

can be resolved using the partial phrase index find the locations of new york and merging with the

inverted index postings list for city.

Page 29: Fast Phrase Querying With Combined Indexes

Three-Way Index Combination

It is include a parcial nextword, partial phrase, and full inverted index.

Page 30: Fast Phrase Querying With Combined Indexes

Next . . .

Introduction (Fin)

Properties of Phrase Queries (Fin)

Inverted Index in Phrase Queries (Fin)

Partial Phrase and Nextword Indexing (Fin)

Combining Phrase and Inverted Indexing (Fin)

Experimental Result

Conclusion

Page 31: Fast Phrase Querying With Combined Indexes

Experimental Result

All expriments were run on intel 700 Mhz Pentium III based server with 2 GB of memory.

Result of Inverted and Nextword Indexing

This table is include the memory usage of the combinations.

Page 32: Fast Phrase Querying With Combined Indexes

Result of Inverted and Nextword Indexing

Result of n terms queries with Inverted and Nextword Indexing

Page 33: Fast Phrase Querying With Combined Indexes

Result of Inverted Index and Phrase

This test evaluate in 100, 1000, 10000 most frequent distinct queries

Phrase index was less than %0.1of the collection 2.1MB, 4,8 MB, 12,8 MB

In query logs, an american dictionary of the english language AND los angeles department of

water and power are in 10000 common queries. Experimental results,

Page 34: Fast Phrase Querying With Combined Indexes

Result of Inverted Index, Nextword Index and Phrase

This result is based 66000 queries' testing with using phase queries as common 10000 queries, nextword(only stopped word) and inverted indexing.

Page 35: Fast Phrase Querying With Combined Indexes

Next . . .

Introduction (Fin)

Properties of Phrase Queries (Fin)

Inverted Index in Phrase Queries (Fin)

Partial Phrase and Nextword Indexing (Fin)

Combining Phrase and Inverted Indexing (Fin)

Experimental Result(Fin)

Conclusion

Page 36: Fast Phrase Querying With Combined Indexes

Conclusion