Top Banner
Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto http:// www.sims.berkeley.edu/~hearst/irbook / Prof. Raymond J. Mooney in CS378 at University of Texas Introduction to Modern Information Retrieval by Gerald Salton and Michael J. McGill, 1983, McGraw-Hill.
27

Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

Jan 02, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

Information Retrieval

CSE 8337

Spring 2007

Query Languages & Matching

Material for these slides obtained from:Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto

http://www.sims.berkeley.edu/~hearst/irbook/Prof. Raymond J. Mooney in CS378 at University of TexasIntroduction to Modern Information Retrieval by Gerald Salton and Michael J. McGill,

1983, McGraw-Hill.

Page 2: Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

CSE 8337 Spring 2007 2

Query Languages TOC Keyword Based Boolean Weighted Boolean Context Based (Phrasal &

Proximity) Pattern Matching Structural Queries

Page 3: Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

CSE 8337 Spring 2007 3

Keyword Based Queries

Basic Queries Single word Multiple words

Context Queries Phrase Proximity

Page 4: Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

CSE 8337 Spring 2007 4

Boolean Queries Keywords combined with Boolean

operators: OR: (e1 OR e2) AND: (e1 AND e2) BUT: (e1 BUT e2) Satisfy e1 but not e2

Negation only allowed using BUT to allow efficient use of inverted index by filtering another efficiently retrievable set.

Naïve users have trouble with Boolean logic.

Page 5: Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

CSE 8337 Spring 2007 5

Boolean Retrieval with Inverted Indices

Primitive keyword: Retrieve containing documents using the inverted index.

OR: Recursively retrieve e1 and e2 and take union of results.

AND: Recursively retrieve e1 and e2 and take intersection of results.

BUT: Recursively retrieve e1 and e2 and take set difference of results.

Page 6: Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

CSE 8337 Spring 2007 6

Weighted Boolean Queries Extension to boolean queries which

adds weights to terms. Weights indicate the degree of

importance that that term has in the operation.

Traditional Boolean Interpretation:

BAA BA11 BA 1A 11 BA

Page 7: Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

CSE 8337 Spring 2007 7

Processing Weighted Boolean Queries

If both query and document terms are weighted, then retrieve a document if the document term weight is greater than the query term weight.

If documents are not weighted rank based on some similarity measure.

The weights in the query determines which (number) documents to be included.

The next few slides show how this can be done in a system with no document term weights.

Page 8: Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

CSE 8337 Spring 2007 8

Weighted Boolean Interpretation

A0 A0.33 A0.5 A0.66 A1

To perform Boolean operations, must be able to determine distance from items in A(B) to B(A).Centroid based distances

Page 9: Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

CSE 8337 Spring 2007 9

Weighted OR

Aa OR Bb

As b gets closer to 1, include more items in B which are closest to A.

As a gets closer to 1, include more items in A which are closest to B.

A B

A OR B0.33 – All items in A plus 1/3 of those in B

A0.66 OR B0.33 – 2/3 the items in A and 1/3 of

those in B

A B

Page 10: Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

CSE 8337 Spring 2007 10

Weighted AND

Aa AND Bb

As b gets closer to 1, items in A-B farthest from intersection are removed.

As a gets closer to 1, items in B-A farthest from intersection are removed.

A AND B0.33 – Remove all of B and 1/3 of A outside of intersection.

A B

A0.66 AND B0.33 – Remove 1/3 of A and

2/3 of B.

A B

Page 11: Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

CSE 8337 Spring 2007 11

Weighted NOT

Aa NOT Bb

As b gets closer to 1, remove more items from intersection that are farthest from A-B.

a indicates how much of A to include.

A B

A NOT B0.33 – All items in A-B and 2/3 of items in intersection.

A0.66 NOT B0.33 – 2/3 the items in A-B and 2/3

of items in intersection.

A B

Page 12: Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

CSE 8337 Spring 2007 12

Phrasal Queries

Retrieve documents with a specific phrase (ordered list of contiguous words) “information theory”

May allow intervening stop words and/or stemming. “buy camera” matches:

“buy a camera” “buying the cameras” etc.

Page 13: Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

CSE 8337 Spring 2007 13

Phrasal Retrieval with Inverted Indices

Must have an inverted index that also stores positions of each keyword in a document.

Retrieve documents and positions for each individual word, intersect documents, and then finally check for ordered contiguity of keyword positions.

Best to start contiguity check with the least common word in the phrase.

Page 14: Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

CSE 8337 Spring 2007 14

Phrasal Search Algorithm1. Find set of documents D in which all keywords (k1…km) in

phrase occur (using AND query processing).2. Intitialize empty set, R, of retrieved documents.3. For each document, d, in D do4. Get array, Pi , of positions of occurrences for each ki in d

5. Find shortest array Ps of the Pi’s

6. For each position p of keyword ks in Ps do

7. For each keyword ki except ks do

8. Use binary search to find a position (p – s + i ) in the

array Pi

1. If correct position for every keyword found, add d to R2. Return R

Page 15: Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

CSE 8337 Spring 2007 15

Proximity Queries

List of words with specific maximal distance constraints between terms.

Example: “dogs” and “race” within 4 words match “…dogs will begin the race…”

May also perform stemming and/or not count stop words.

Page 16: Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

CSE 8337 Spring 2007 16

Proximity Retrieval with Inverted Index

Use approach similar to phrasal search to find documents in which all keywords are found in a context that satisfies the proximity constraints.

During binary search for positions of remaining keywords, find closest position of ki to p and check that it is within maximum allowed distance.

Page 17: Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

CSE 8337 Spring 2007 17

Pattern Matching

Allow queries that match strings rather than word tokens.

Requires more sophisticated data structures and algorithms than inverted indices to retrieve efficiently.

Page 18: Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

CSE 8337 Spring 2007 18

Simple Patterns Prefixes: Pattern that matches start of word.

“anti” matches “antiquity”, “antibody”, etc. Suffixes: Pattern that matches end of word:

“ix” matches “fix”, “matrix”, etc. Substrings: Pattern that matches arbitrary

subsequence of characters. “rapt” matches “enrapture”, “velociraptor” etc.

Ranges: Pair of strings that matches any word lexicographically (alphabetically) between them. “tin” to “tix” matches “tip”, “tire”, “title”, etc.

Page 19: Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

CSE 8337 Spring 2007 19

Allowing Errors

What if query or document contains typos or misspellings?

Judge similarity of words (or arbitrary strings) using: Edit distance (cost of

insert/delete/match) Longest Common Subsequence (LCS)

Allow proximity search with bound on string similarity.

Page 20: Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

CSE 8337 Spring 2007 20

Longest Common Subsequence (LCS)

Length of the longest subsequence of characters shared by two strings.

A subsequence of a string is obtained by deleting zero or more characters.

Examples: “misspell” to “mispell” is 7 “misspelled” to “misinterpretted” is 7

“mis…p…e…ed”

Page 21: Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

CSE 8337 Spring 2007 21

Regular Expressions Language for composing complex patterns

from simpler ones. An individual character is a regex. Union: If e1 and e2 are regexes, then (e1 | e2 ) is a

regex that matches whatever either e1 or e2

matches. Concatenation: If e1 and e2 are regexes, then e1 e2

is a regex that matches a string that consists of a substring that matches e1 immediately followed by a substring that matches e2

Repetition: If e1 is a regex, then e1* is a regex that matches a sequence of zero or more strings that match e1

Page 22: Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

CSE 8337 Spring 2007 22

Regular Expression Examples

(u|e)nabl(e|ing) matches unable unabling enable enabling

(un|en)*able matches able unable unenable enununenable

Page 23: Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

CSE 8337 Spring 2007 23

Enhanced Regex’s (Perl) Special terms for common sets of characters,

such as alphabetic or numeric or general “wildcard”.

Special repetition operator (+) for 1 or more occurrences.

Special optional operator (?) for 0 or 1 occurrences.

Special repetition operator for specific range of number of occurrences: {min,max}. A{1,5} One to five A’s. A{5,} Five or more A’s A{5} Exactly five A’s

Page 24: Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

CSE 8337 Spring 2007 24

Perl Regex’s

Character classes: \w (word char) Any alpha-numeric (not: \W) \d (digit char) Any digit (not: \D) \s (space char) Any whitespace (not: \S) . (wildcard) Anything

Anchor points: \b (boundary) Word boundary ^ Beginning of string $ End of string

Page 25: Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

CSE 8337 Spring 2007 25

Perl Regex Examples

U.S. phone number with optional area code: /\b(\(\d{3}\)\s?)?\d{3}-\d{4}\b/

Email address: /\b\S+@\S+(\.com|\.edu|\.gov|\.org|\.net)\b/

Note: Packages available to support Perl regex’s in Java

Page 26: Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

CSE 8337 Spring 2007 26

Structural Queries Assumes documents have structure

that can be exploited in search. Structure could be:

Fixed set of fields, e.g. title, author, abstract, etc.

Hierarchical (recursive) tree structure:

chapter

title section title section

title subsection

chapter

book

Page 27: Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.

CSE 8337 Spring 2007 27

Queries with Structure

Allow queries for text appearing in specific fields: “nuclear fusion” appearing in a chapter title

SFQL: Relational database query language SQL enhanced with “full text” search.

Select abstract from journal.papers where author contains “Teller” and

title contains “nuclear fusion” and date < 1/1/1950