Top Banner
Special Topics in Computer Science Special Topics in Computer Science The Art of Information The Art of Information Retrieval Retrieval Chapter 4: Query Chapter 4: Query Languages Languages Alexander Gelbukh www.Gelbukh.com
24

Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh .

Mar 27, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh .

Special Topics in Computer ScienceSpecial Topics in Computer Science

The Art of Information RetrievalThe Art of Information Retrieval

Chapter 4: Query LanguagesChapter 4: Query Languages

Alexander Gelbukh

www.Gelbukh.com

Page 2: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh .

2

Previous ChapterPrevious Chapter

Main measures: Precision & Recall.o For sets

o Rankings are evaluated through initial subsets

There are measures that combine them into oneo Involve user-defined preferences. In F-measure set to 50-50

Many (other) characteristicso An algorithm can be good at some and bad at others

o Averages are used, but not always are meaningful

Reference collection exists with known answers to evaluate new algorithms

Page 3: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh .

3

Previous chapter: research issuesPrevious chapter: research issues

Different types of interfaces; interactive systems:o What measures to use?

o How people judge relevance?

o How the “user satisfaction” can be measured? Modeled?

Page 4: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh .

4

Query languagesQuery languages

Query language = type of possible queries Type of queries depend on the IR model Types:

o IR (= ranked output)o Data retrieval

o User-orientedo Low-level (= protocols)

Assume all pre-processing has been doneo Thesaurus, stop-words, ...

o (I think this must be a part of the language!)

Returns “documents” (chapter, paragraph, ...)

Page 5: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh .

5

In this chapterIn this chapter

Keyword-based languages Pattern matching Structure taken into account Protocols

Page 6: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh .

6

Keyword-based languages: Single wordKeyword-based languages: Single word

Intuitive, easy to express, fast ranking.o Words can be highlighted in the output.

What a word is? o Letters, separators

o Non-splitting characters: on-line.

o Database decides.

TF-IDF are designed for words Used for the main models (Boolean, Vector,

Probabilistic)

Page 7: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh .

7

Keyword-based languages:Keyword-based languages:Context QueriesContext Queries

Ensure that the words are related Phrase

o “enhance retrieval”

o Allows separators and stopwords: “enhance the retrieval”

Proximityo “enhance the quality of information retrieval”

o Distance: words, letters. Order: same or not

Not clear how to ranko Research issue

Page 8: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh .

8

Keyword-based languages:Keyword-based languages:Boolean QueriesBoolean Queries

Boolean expressions (can combine basic queries)

Query syntax tree

o translation AND (syntax OR syntactic)

operations on the setso Result: set

OR, AND, e1 BUT e2

o NOT not used, could give (almost) all docs (= unsafe)

Good: Can highlight occurrences, sort Bad: Difficult for the users Remedy (?): fuzzy Boolean (see below).

Basic = keyword, pattern

Page 9: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh .

9

Keyword-based languages:Keyword-based languages:Fuzzy Boolean, Fuzzy Boolean, Natural LanguageNatural Language

Fuzzy Boolean: OR AND = some.o AND punishes for absence, OR encourages multiple.

o Natural ranking: how many times?

Natural Language: OR = ANDo BUT can be expressed (= penalty)

o How to rank? Different ways

Vector space modelo Query is a vector

o A doc can be taken as a vector. Relevance feedback!

Proximity is ignoredo (Why? Research issue.)

Page 10: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh .

10

Pattern matching...Pattern matching...

Pattern = sequence of featureso Text segment matches the pattern

Types: Words Prefixes, suffixes, substrings:

o comput-, -ters, -any flow- (many flowers). Ranges

o implies some order, e.g., lexicographical = alphabetic Allowing errors

o Levenshtein (= edit) distance: historical / hystericalo # insertions, deletions, replacements. Threshold.

Page 11: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh .

11

...Pattern matching...Pattern matching

...Types Regular expressions

o union = or: if e1, e2 are expressions, (e1 | e2) too

o concatenation: e1 e2

o repetition: e* (0 or more occurrences)

Extended patternso user-friendly; can be internally converted into simple

o case-insensitive, “anything” (wildcard), digit, vowel, ...

o conditionals, optional

o some parts match exactly and other with errors,

o etc.

Page 12: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh .

12

Structural queriesStructural queries

Old days: fields. No nesting, no overlap, fixed order.o Email: subject, body, sender, ...

o = Relational database with text type, treated as text should be

o Versions of SQL with text operators

Hypertexto Not well developed. Too free

o WebGlimpse: search the neighborhood

Hierarchicalo Intermediate level of freedom

o Volumes, chapters, sections, paragraphs, sentences, ...

Page 13: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh .

Too fixed Too free Intermediate

Page 14: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh .

14

Hierarchical Models ...Hierarchical Models ...

PAT expressionso Hierarchy is defined at query time.

o Regions are included in the index, e.g., sections, italics, ...

o Different types of regions can overlap, same type can’t

o Can query for words in a region, regions in a region, etc.

o Complex computation, unclear semantics

Overlapped listso Evolution of PAT: areas of same type can overlap (not nest)

o Uses same inverted file

o Can combine regions, specify order, ...

o n-words: all (overlapping) areas of n words.

Page 15: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh .

15

Overlapping listsOverlapping lists

Page 16: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh .

16

... Hierarchical Models ...... Hierarchical Models ...

List of referenceso Answers are references (pointers) to regions

o Only one type of regions (e.g., only sections). No nesting.

o Known at index time

o Ancestry of nodes. Can query paths

Proximal nodeso Compromise between expressiveness and efficiency

o Many (overlapping) fixed hierarchies

o Interesting queries: “3rd paragraph of each chapter”, ...

Page 17: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh .

17

Proximal nodesProximal nodes

Page 18: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh .

18

... Hierarchical Models ... Hierarchical Models

Tree matchingo Query is a tree. Match the text tree.

o Ordered or unordered trees (are siblings ordered?)

o Prolog-like constraints on different parts of the tree Variables

o Answer: root of a match

o Very inefficient (usually NP-hard) Due to variables and unordered matching

Page 19: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh .

19

Research issuesResearch issuesin hierarchical modelsin hierarchical models

Static or dynamic?o Define the hierarchy at index time or at query time?

o Static: text markup. Dynamic: tags, indexed.

Restrictions on the structureo Restrict structure of restrict the query language

o For efficiency

Integration with texto of secondary importance: structure (in IR) or text (in DB)?

o combine

Query languageo Standardization, expressiveness taxonomy, categorization

Page 20: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh .

20

Query protocolsQuery protocols

Used internally Standard: one client can query different libraries

o In CD-ROMS, disk interchangeability

Z39.50: bibliographic (used for other types, too) WAIS (Wide Area Information Service)

o Includes Z39.50

For CD-ROMs:o CCL, Common Command Language

o CD-RDx (Compact Disk Read only Data Exchange)

o SFQL (Structured Full-text Query Language). Like DB.

Page 21: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh .

Types of querieswe have discussed

Page 22: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh .

22

Trends and research topicsTrends and research topics

Models: to better understand the user needs Query languages: flexibility, power, expressiveness,

functionality Visual languages

o Example: library shown on the screen. Act: take books, open catalogs, etc.

o Better Boolean queries: “I need books by Cervantes AND Lope de Vega”?!

Page 23: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh .

23

ConclusionsConclusions

Width-wide:o words, phrases, proximity, fuzzy Boolean, natural

language

Depth-wide:o Pattern matching

If return sets, can be combined using Boolean model Combining with structure

o Hierarchical structure

Standardized low level languages: protocolso Reusable

Page 24: Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh .

24

Thank you!

Till October 16October 23: midterm exam