Top Banner
Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS
40

Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

Apr 02, 2015

Download

Documents

Elliot Reuben
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

Information Retrieval Techniques

MS(CS) Lecture 2AIR UNIVERSITY MULTAN CAMPUS

Page 2: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

Issues and Challenges in IR

RELEVANCE ?

Page 3: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

Issues and Challenges in IR

• Query Formulation– Describing information need

• Relevance– Relevant to query (system relevancy)– Relevant to information need (User relevancy)

• Evaluation– System oriented (Bypass User)– User Oriented (Relevance Feedback)

Page 4: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

What makes IR “experimental”?

• Evaluation– How do design experiments that answer our

questions?– How do we assess the quality of the documents

that come out of the IR black box?– Can we do this automatically?

Page 5: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

Simplification?Source

Selection

Search

Query

Selection

Ranked List

Examination

Documents

Delivery

Documents

QueryFormulation

Resource

query reformulation,vocabulary learning,relevance feedback

source reselection

Is this itself a vast simplification?

Page 6: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

The Central Problem in IRInformation Seeker Authors

Concepts Concepts

Query Terms Document Terms

Do these represent the same concepts?

Page 7: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

Problems in Query FormulationStefano Mizzaro Model of Relevance in IR

• RIN: Real Information Need (Target)• PIN: Perceived Information Need (Mentality)• EIN: Expressed Information Need (Natural Lng)• FIN: Formal Information Need (Query)

Paper reference 4 dimensions of Relevance by stephen W Draper

Page 8: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

Taylor’s Model

• The visceral need (Q1) the actual, but unexpressed, need for information

• The conscious need (Q2) the conscious within-brain description of the need

• The formalized need (Q3) the formal statement of the question

• The compromised need (Q4) the question as presented to the information system

Robert S. Taylor. (1962) The Process of Asking Questions. American Documentation, 13(4), 391--396.

Page 9: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

Taylor’s Model and IR SystemsVisceral need (Q1)

Conscious need (Q2)

Formalized need (Q3)

Compromised need (Q4)

IR System

Results

naïve usersQuestion

Negotiation

Page 10: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

how trap mice alive

The classic search model

Collection

User task

Info need

Query

Results

Searchengine

Queryrefinement

Get rid of mice in a politically correct way

Info about removing micewithout killing them

Misconception?

Misformulation?

Search

Page 11: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

Building Blocks of IRS-I• Different models of information retrieval

– Boolean model– Vector space model– Languages models

• Representing the meaning of documents– How do we capture the meaning of documents?– Is meaning just the sum of all terms?

• Indexing– How do we actually store all those words?– How do we access indexed terms quickly?

Page 12: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

• Relevance Feedback– How do humans (and machines) modify queries

based on retrieved results?• User Interaction

– Information retrieval meets computer-human interaction

– How do we present search results to users in an effective manner?

– What tools can systems provide to aid the user in information seeking?

Building Blocks of IRS-II

Page 13: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

IR Extensions

• Filtering and Categorization– Traditional information retrieval: static collection,

dynamic queries– What about static queries against dynamic collections?

• Multimedia Retrieval– Thus far, we’ve been focused on text…– What about images, sounds, video, etc.?

• Question Answering– We want answers, not just documents!

Page 14: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

CAN U GUESS WHAT DATA IS MAINLY FOCUSED BY IR?

StructuredUnstructured

Semi structured

Page 15: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

What about databases?

• What are examples of databases?– Banks storing account information– Retailers storing inventories– Universities storing student grades

• What exactly is a (relational) database?– Think of them as a collection of tables– They model some aspect of “the world”

Page 16: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

A (Simple) Database Example

Department ID DepartmentEE Electrical EngineeringHIST HistoryCLIS Information Studies

Course ID Course Namelbsc690 Information Technologyee750 Communicationhist405 American History

Student ID Course ID Grade1 lbsc690 901 ee750 952 lbsc690 952 hist405 803 hist405 904 lbsc690 98

Student ID Last Name First Name Department ID email1 Arrows John EE jarrows@wam2 Peters Kathy HIST kpeters2@wam3 Smith Chris HIST smith2002@glue4 Smith John CLIS js03@wam

Student Table

Department Table Course Table

Enrollment Table

Page 17: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

IR vs. databases:Structured vs unstructured data

• Structured data tends to refer to information in “tables”

17

Employee Manager Salary

Smith Jones 50000

Chang Smith 60000

50000Ivy Smith

Typically allows numerical range and exact match(for text) queries, e.g., Salary < 60000 AND Manager = Smith.

Page 18: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

Database Queries

• What would you want to know from a database?– What classes is John Arrow enrolled in?– Who has the highest grade in LBSC 690?– Who’s in the history department?– Of all the non-CLIS students taking LBSC 690 with

a last name shorter than six characters and were born on a Monday, who has the longest email address?

Page 19: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

Unstructured data

• Typically refers to free text• Allows

– Keyword queries including operators– More sophisticated “concept” queries e.g.,

• find all web pages dealing with drug abuse

• Classic model for searching text documents

19

Page 20: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

2020

Page 21: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

2121

Page 22: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

Semi-structured data

• In fact almost no data is “unstructured”• E.g., this slide has distinctly identified zones such

as the Title and Bullets• … to say nothing of linguistic structure

• Facilitates “semi-structured” search such as– Title contains data AND Bullets contain search

• Or even– Title is about Object Oriented Programming AND

Author something like stro*rup – where * is the wild-card operator

22

Page 23: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

Hopkins IR Workshop 2005 Copyright © Victor Lavrenko

Comparing IR to databases

Databases IR

Data Structured Unstructured

Fields Clear semantics (SSN, age)

No fields (other than text)

QueriesDefined (relational algebra, SQL)

Free text (“natural language”), Boolean

RecoverabilityCritical (concurrency control, recovery, atomic operations)

Downplayed, though still an issue

MatchingExact (results are always “correct”)

Imprecise (need to measure effectiveness)

Page 24: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

Databases vs. IR

Other issues

Interaction with system

Results we get

Queries we’re posing

What we’re retrieving

IRDatabases

Issues downplayed.Concurrency, recovery, atomicity are all critical.

Interaction is important.One-shot queries.

Sometimes relevant, often not.

Exact. Always correct in a formal sense.

Vague, imprecise information needs (often expressed in natural language).

Formally (mathematically) defined queries. Unambiguous.

Mostly unstructured. Free text with some metadata.

Structured data. Clear semantics based on a formal model.

Page 25: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

IRS IN ACTION (TASKS)

Information Retrieval and Web SearchPandu Nayak and Prabhakar Raghavan

Page 26: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

Outline

• What is the IR problem?• How to organize an IR system? (Or

the main processes in IR)• Indexing• Retrieval

Page 27: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

The problem of IR

• Goal = find documents relevant to an information need from a large document set

Document collection

Info. need

Query

Answer list

IR systemRetrieval

Page 28: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

IR problem• First applications: in libraries (1950s)

ISBN: 0-201-12227-8 Author: Salton, Gerard Title: Automatic text processing: the transformation,

analysis, and retrieval of information by computer Editor: Addison-Wesley Date: 1989Content: <Text>

• external attributes and internal attribute (content)• Search by external attributes = Search in DB• IR: search by content

Page 29: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

Possible approaches

1.String matching (linear search in documents)- Slow- Difficult to improve

2.Indexing (*)- Fast- Flexible to further improvement

Page 30: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

Indexing-based IR

Document Query

indexing indexing (Query analysis)

Representation Representation(keywords) Query (keywords)

evaluation

Page 31: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

Main problems in IR

• Document and query indexing– How to best represent their contents?

• Query evaluation (or retrieval process)– To what extent does a document

correspond to a query?• System evaluation

– How good is a system? – Are the retrieved documents relevant?

(precision)– Are all the relevant documents retrieved?

(recall)

Page 32: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

The basic indexing pipeline

Tokenizer

Token stream. Friends Romans Countrymen

Linguistic modules

Modified tokens. friend roman countryman

Indexer

Inverted index.

friend

roman

countryman

2 4

2

13 16

1

Documents tobe indexed.

Friends, Romans, countrymen.

32

Page 33: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

33

Document indexing

• Goal = Find the important meanings and create an internal representation

• Factors to consider:– Accuracy to represent meanings (semantics)– Exhaustiveness (cover all the contents)– Facility for computer to manipulate

• What is the best representation of contents?– Char. string (char trigrams): not precise enough– Word: good coverage, not precise– Phrase: poor coverage, more precise– Concept: poor coverage, precise

Coverage(Recall)

Accuracy(Precision)String Word Phrase Concept

Page 34: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

Parsing a document

• What format is it in?– pdf/word/excel/html?

• What language is it in?• What character set is in use?

Each of these is a classification problem, which we will study later in the course.

But these tasks are often done heuristically …

Sec. 2.1

34

Page 35: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

Complications: Format/language

• Documents being indexed can include docs from many different languages– A single index may have to contain terms of several

languages.• Sometimes a document or its components can contain

multiple languages/formats– French email with a German pdf attachment.

• What is a unit document?– A file?– An email? (Perhaps one of many in an mbox.)– An email with 5 attachments?– A group of files (PPT or LaTeX as HTML pages)

Sec. 2.1

35

Page 36: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

HOW TO CONSTRUCT INDEXOF TERMS?

Page 37: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

• function words do not bear useful information for IRof, in, about, with, I, although, …

• Stoplist: contain stopwords, not to be used as index– Prepositions– Articles– Pronouns– Some adverbs and adjectives– Some frequent words (e.g. document)

• The removal of stopwords usually improves IR effectiveness

• A few “standard” stoplists are commonly used.

Stopwords / Stoplist

Page 38: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

Stop words• With a stop list, you exclude from the dictionary

entirely the commonest words. Intuition:– They have little semantic content: the, a, and, to, be– There are a lot of them: ~30% of postings for top 30 words

• But the trend is away from doing this:– Good compression techniques means the space for including stop words

in a system is very small– Good query optimization techniques mean you pay little at query time

for including stop words.– You need them for:

• Phrase queries: “King of Denmark”• Various song titles, etc.: “Let it be”, “To be or not to be”• “Relational” queries: “flights to London”

Sec. 2.2.2

38

Page 39: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

Stemming

• Reason: – Different word forms may bear similar meaning (e.g. search,

searching): create a “standard” representation for them• Stemming:

– Removing some endings of word computercompute computescomputingcomputedcomputation

comput

Page 40: Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

Stemming

• Reduce terms to their “roots” before indexing• “Stemming” suggest crude affix chopping

– language dependent– e.g., automate(s), automatic, automation all

reduced to automat.

for example compressed and compression are both accepted as equivalent to compress.

for exampl compress andcompress ar both acceptas equival to compress

Sec. 2.2.4

40