Top Banner
INFORMATION RETRIEVAL A Look into the Science of Web Search Engines 1 Reach me on Twitter: @matifq Email [email protected] Muhammad Atif Qureshi
69

Information Retrieval

May 13, 2015

Download

Technology

This talk features the basics behind the science of Information Retrieval with a story-mode on information and its various aspects. It then takes you through a quick journey into the process behind building of the search engine.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Information Retrieval

Reach me on Twitter: @matifq Email [email protected] 1

INFORMATION RETRIEVALA Look into the Science of Web Search Engines

Muhammad Atif Qureshi

Page 2: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

2

Contents

Story Mode Learning Learning by Imagination Appendix

Page 3: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

3

Story Mode Learning(Borrowed from Prof. Jimmy Lin,

University of Maryland, Scientist in Twitter)

Page 4: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

4

Information Retrieval Systems Information

What is “information”? Retrieval

What do we mean by “retrieval”? What are different types information

needs? Systems

How do computer systems fit into the human information seeking process?

Page 5: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

5

What is Information?

What do you think? There is no “correct” definition Cookie Monster’s definition:

“news or facts about something” Different approaches:

Philosophy Psychology Linguistics Electrical engineering Physics Computer science Information science

Page 6: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

6

Dictionary says…

Oxford English Dictionary information: informing, telling; thing told,

knowledge, items of knowledge, news knowledge: knowing familiarity gained by

experience; person’s range of information; a theoretical or practical understanding of; the sum of what is known

Random House Dictionary information: knowledge communicated or

received concerning a particular fact or circumstance; news

Page 7: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

7

Intuitive Notions

Information must Be something, although the exact nature

(substance, energy, or abstract concept) is not clear;

Be “new”: repetition of previously received messages is not informative

Be “true”: false or counterfactual information is “mis-information”

Be “about” something

Robert M. Losee. (1997) A Discipline Independent Definition of Information. Journal of the American Society for Information Science,

48(3), 254-269.

Page 8: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

8

Three Views of Information

Information as process Information as communication Information as message transmission

and reception

Page 9: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

9

One View

Information = characteristics of the output of a process Tells us something about the process and the

input

Information-generating process do not occur in isolation

ProcessInput

Input

Input

Output

Output

Output

Process1 Process2Input Output…

Page 10: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

10

Where’s the human?

If a tree falls in the forest, and no one is around to hear it, is information transmitted?

In the “information as process”: Yes, but that’s not very interesting to us

We’re concerned about information for human consumption Transmission of information from one

person to another Recording of information Reconstruction of stored information

Page 11: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

11

Another View

Information science is characterized by “the deliberate (purposeful) structure of the message by the sender in order to affect the image structure of the recipient” This implies that the sender has knowledge of the

recipient's structure Text = “a collection of signs purposefully

structured by a sender with the intention of changing image-structure of a recipient”

Information = “the structure of any text which is capable of changing the image-structure of a recipient”

Page 12: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

12

Transfer of Information

Communication = transmission of information

Thoughts

Words

Sounds

Thoughts

Words

Sounds

Encoding Decoding

Speech

Writing

Telepathy?

Page 13: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

13

Information Theory

Better called “communication theory” Developed by Claude Shannon in 1940’s

Concerned with the transmission of electrical signals over wires

How do we send information quickly and reliably? Underlies modern electronic communication:

Voice and data traffic… Over copper, fiber optic, wireless, etc.

Famous result: Channel Capacity Theorem Formal measure of information in terms of entropy

Information = “reduction in surprise”

Page 14: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

14

The Noisy Channel Model

Communication = producing the same message at the destination that was sent at the source The message must be encoded for

transmission across a medium (called channel)

But the channel is noisy and can distort the message

Semantics (meaning) is irrelevant

Source Destination

channelmessage Receiver messageTransmitter

noise

Page 15: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

15

A Synthesis

Information retrieval as communication over time and space, across a noisy channelSource Destination

Transmitter Receiverchannelmessage message

noise

Sender Recipient

Encoding Decodingstoragemessage message

noise

indexing/writing retrieval/reading

Page 16: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

16

“Retrieval?”

“Fetch something” that’s been stored Recover a stored state of knowledge Search through stored messages to find

some messages relevant to the task at hand

Sender Recipient

Encoding Decodingstoragemessage message

noise

indexing/writing Retrieval/reading

Page 17: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

17

What is IR?

Information retrieval is a problem-oriented discipline, concerned with the problem of the effective and efficient transfer of desired information between human generator and human user

Anomalous States of Knowledge as a Basis for Information Retrieval. (1980) Nicholas J. Belkin. Canadian Journal of Information Science, 5, 133-

143.

Page 18: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

18

Types of Information Needs

Retrospective “Searching the past” Different queries posed against a static

collection Time invariant

Prospective “Searching the future” Static query posed against a dynamic

collection Time dependent

Page 19: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

19

Retrospective Searches (I)

Ad hoc retrieval: find documents “about this”

Known item search

Directed exploration

Identify positive accomplishments of the Hubble telescope since it was launched in 1991.

Compile a list of mammals that are considered to be endangered, identify their habitat and, if possible, specify what threatens them.

Find Jimmy Lin’s homepage.

What’s the ISBN number of “Modern Information Retrieval”?

Who makes the best chocolates?

What video conferencing systems exist for digital reference desk services?

Page 20: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

20

Retrospective Searches (II)

Question answeringWho discovered Oxygen?When did Hawaii become a state?Where is Ayer’s Rock located?What team won the World Series in 1992?

“Factoid”

What countries export oil?Name U.S. cities that have a “Shubert” theater.

“List”

Who is Aaron Copland?What is a quasar?

“Definition”

Page 21: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

21

Prospective “Searches”

Filtering Make a binary decision about each

incoming document

Routing Sort incoming documents into different

bins?

Spam or not spam?

Categorize news headlines: World? Nation? Metro? Sports?

Page 22: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

22

What types of information?

Text (Documents and portions thereof) XML and structured documents Images Audio (sound effects, songs, etc.) Video Source code Applications/Web services

Page 23: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

23

Content-Based Search

This is a relative new concept! What else would you search on? What’s more effective? Why is this hard in many applications?

Page 24: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

24

Interesting Examples

Google image search

Google video search

Query by humming

http://images.google.com/

http://www.cs.cornell.edu/Info/Faculty/bsmith/query-by-humming.html

http://video.google.com/

Page 25: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

25

What about databases?

What are examples of databases? Banks storing account information Retailers storing inventories Universities storing student grades

What exactly is a (relational) database? Think of them as a collection of tables They model some aspect of “the world”

Page 26: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

26

A (Simple) Database Example

Department ID DepartmentEE Electrical EngineeringHIST HistoryCLIS Information Studies

Course ID Course Namelbsc690 Information Technologyee750 Communicationhist405 American History

Student ID Course ID Grade1 lbsc690 901 ee750 952 lbsc690 952 hist405 803 hist405 904 lbsc690 98

Student ID Last Name First Name Department ID email1 Arrows John EE jarrows@wam2 Peters Kathy HIST kpeters2@wam3 Smith Chris HIST smith2002@glue4 Smith John CLIS js03@wam

Student Table

Department Table Course Table

Enrollment Table

Page 27: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

27

Database Queries

What would you want to know from a database? What classes is John Arrow enrolled in? Who has the highest grade in LBSC 690? Who’s in the history department? Of all the non-CLIS students taking LBSC

690 with a last name shorter than six characters and were born on a Monday, who has the longest email address?

Page 28: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

28

Databases vs. IR

Other issues

Interaction with system

Results we get

Queries we’re posing

What we’re retrieving

IRDatabases

Issues downplayed.Concurrency, recovery, atomicity are all critical.

Interaction is important.

One-shot queries.

Sometimes relevant, often not.

Exact. Always correct in a formal sense.

Vague, imprecise information needs (often expressed in natural language).

Formally (mathematically) defined queries. Unambiguous.

Mostly unstructured. Free text with some metadata.

Structured data. Clear semantics based on a formal model.

Page 29: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

29

The Big Picture

The four components of the information retrieval environment: User Process System Collection

What computer geeks care about!What we care about!

Page 30: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

30

The Information Retrieval Cycle

SourceSelection

Search

Query

Selection

Ranked List

Examination

Documents

Delivery

Documents

QueryFormulation

Resource

query reformulation,vocabulary learning,relevance feedback

source reselection

Page 31: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

31

Supporting the Search Process

SourceSelection

Search

Query

Selection

Ranked List

Examination

Documents

Delivery

Documents

QueryFormulation

Resource

Indexing Index

Acquisition Collection

Page 32: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

32

Simplification?Source

Selection

Search

Query

Selection

Ranked List

Examination

Documents

Delivery

Documents

QueryFormulation

Resource

query reformulation,vocabulary learning,relevance feedback

source reselection

Is this itself a vast simplification?

Page 33: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

33

Tackling the IR Challenge

Divide and conquer! Strategy: use encapsulation to limit

complexity Approach:

Define interfaces (input and output) for each component

Define the functions performed by each component

Study each component in isolation Repeat the process within components as

needed Make sure that this decomposition makes

sense Result: a hierarchical decomposition

Page 34: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

34

Where do we make the cut?

Study the IR black box in isolation Simple behavior: in goes query, out comes

documents Optimize the quality of documents that come out

Study everything else around the black box Put the human back in the loop!

Search

Query

Ranked List

Page 35: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

35

The IR Black Box

DocumentsQuery

Hits

Page 36: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

36

Inside The IR Black Box

DocumentsQuery

Hits

RepresentationFunction

RepresentationFunction

Query Representation Document Representation

ComparisonFunction Index

Page 37: Information Retrieval

37

Reach me on Twitter: @matifq Email [email protected]

The Central Problem in IR

Information Seeker Authors

Concepts Concepts

Query Terms Document Terms

Do these represent the same concepts?

Page 38: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

38

Learning by Imagination

Page 39: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

39

Imagine a System

We have 1000s of web pages, what make these web pages different? May be different key terms or key words

occurring in different web pages (e.g., sports, education, video sharing)

Page 40: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

40

Realize Query Needs

What do we expect when query? Query can be single word (no order), collection of

words i.e., free sentence (order does not matter) or strict phrase (order matters e.g., "I love Pakistan")

How to manage data of web pages Bag of words data structure with/without position of

words/terms (simply, posting list of words/terms) What’s the best match?

We have many matching results, but what’s the order?

Page 41: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

41

Order of Matching Results

How could we rank web pages? Via query content matching score against

web pages i.e., content based methods Via importance of web pages i.e., link

based methods

Page 42: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

42

What does Content Tell?

Content Information: Rare terms give more information than

frequent terms as common terms do not differentiate well between the content of documents (Information entropy)

So what does common words make? Stop words (extreme case, e.g., it, a, the) or

words with lesser importance (e.g., word science inside scientific documents)

Page 43: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

43

Ranking Methods

Content based methods: Examples: Tf-idf with cosine similarity,

bm25, etc. Link based methods:

Examples: PageRank, HITS, etc.

Page 44: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

44

What is More in Ranking?

What other measures we can take for ranking better? Combining content based methods with link

based methods

How about learning to rank by user click through data (apply machine learning)

How about learning from social web (apply social science theories)

Page 45: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

45

Lots of Web Pages

How about scalability?

We have too many words, can we limit them? Example: Is Studying conceptually different from

study or studies? may be not (concept called stemming could simply everything to simple concept study)

Stemming may not be sufficient then how about clustering web pages into topics i.e., (terms study, science, arts, university, school, college would single concept or a topic may be called as topic education)

Page 46: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

46

Is it sufficient?

Can we feel confident about how Web Search Engine works? No, it was just a summary for the day

Page 47: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

47

Guess! what next you would see? ?

Page 48: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

48

Our search engineYes we are making it

Page 49: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

49

Appendix

Page 50: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

50

Outline

What is Research? How to prepare yourself for IR research?

Page 51: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

51

What is Research?

Page 52: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

52

What is Research?

Research Discover new knowledge Seek answers to questions

Basic research Goal: Expand man’s knowledge (e.g., which genes control

social behavior of honey bees? ) Often driven by curiosity (but not always) High impact examples: relativity theory, DNA, …

Applied research Goal: Improve human condition (i.e., improve the world)

(e.g., how to cure cancers?) Driven by practical needs High impact examples: computers, transistors,

vaccinations, … The boundary is vague; distinction isn’t important

Page 53: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

53

Why Research?

Amount of knowledge

Advancement of Technology

Utility of Applications

Basic ResearchApplied Research

ApplicationDevelopment

Curiosity

Page 54: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

54

Where’s IR Research?

Amount of knowledge

Advancement of Technology

Utility of Applications

Quality of Life

Basic ResearchApplied Research

ApplicationDevelopment

Information Science

Computer Science

Page 55: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

55

Research Process

Identification of the topic (e.g., Web search)

Hypothesis formulation (e.g., algorithm X is better than Y=state-of-the-art)

Experiment design (measures, data, etc) (e.g., retrieval accuracy on a sample of web data)

Test hypothesis (e.g., compare X and Y on the data)

Draw conclusions and repeat the cycle of hypothesis formulation and testing if necessary (e.g., Y is better only for some queries, now what?)

Page 56: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

56

Typical IR Research Process

Look for a high-impact topic (basic or applied) New problem: define/frame the problem Identify weakness of existing solutions if any Propose new methods Choose data sets (often a main challenge) Design evaluation measures (can be very

difficult) Run many experiments (need to have clear

research hypotheses) Analyze results and repeat the steps above if

necessary Publish research results

Page 57: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

57

Research Methods

Exploratory research: Identify and frame a new problem (e.g., “a survey/outlook of personalized search”)

Constructive research: Construct a (new) solution to a problem (e.g., “a new method for expert finding”)

Empirical research: evaluate and compare existing solutions (e.g., “a comparative evaluation of link analysis methods for web search”)

The “E-C-E cycle”: exploratoryconstructiveempiricalexploratory…

Page 58: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

58

Types of Research Questions and Results

Exploratory (Framework): What’s out there?

Descriptive (Principles): What does it look like? How does it work?

Evaluative (Empirical results): How well does a method solve a problem?

Explanatory (Causes): Why does something happen the way it happens?

Predictive (Models): What would happen if xxx ?

Page 59: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

59

Solid and High Impact Research Solid work:

A clear hypothesis (research question) with conclusive result (either positive or negative)

Clearly adds to our knowledge base (what can we learn from this work?)

Implications: a solid, focused contribution is often better than a non-conclusive broad exploration

High impact = high-importance-of-problem * high-quality-of-solution high impact = open up an important problem high impact = close a problem with the best solution high impact = major milestones in between Implications: question the importance of the problem and

don’t just be satisfied with a good solution, make it the best

Page 60: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

60

How to Prepare Yourself for IR

Research?

Page 61: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

61

What it Takes to do Research? Curiosity: allow you to ask questions Critical thinking: allow you to challenge

assumptions Learning: take you to the frontier of

knowledge Persistence: so that you don’t give up Respect data and truth: ensure your

research is solid Communication: allow you to publish

your work …

Page 62: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

62

Learning about IR (1/2)

Start with an IR text book (e.g., Manning et al., Grossman & Frieder, a forth-coming book from UMass,…)

Then read “Readings in IR” by Karen Sparck Jones, Peter Willett

And read papers recommended in the following article: http://www.sigir.org/forum/2005D/2005d_sigirforum_moffat.pdf

Read other papers published in recent IR/IR-related conferences

Page 63: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

63

Learning about IR (2/2)

Getting more focused Choose your favorite sub-area (e.g., retrieval models) Extend your knowledge about related topics (e.g.,

machine learning, statistical modeling, optimization) Stay in frontier:

Keep monitoring literature in both IR and related areas Broaden your view: Keep an eye on

Industry activities Read about industry trends Try out novel prototype systems

Funding trends Read request for proposals

Page 64: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

64

Critical Thinking

Develop a habit of asking questions, especially why questions

Always try to make sense of what you have read/heard; don’t let any question pass by

Get used to challenging everything Practical advice

Question every claim made in a paper or a talk (can you argue the other way?)

Try to write two opposite reviews of a paper (one mainly to argue for accepting the paper and the other for rejecting it)

Force yourself to challenge one point in every talk that you attend and raise a question

Page 65: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

65

Respect Data and Truth

Be honest with the experiment results Don’t throw away negative results! Try to learn from negative results

Don’t twist data to fit your hypothesis; instead, let the hypothesis choose data

Be objective in data analysis and interpretation; don’t mislead readers

Aim at understanding/explanation instead of just good results

Be careful not to over-generalize (for both good and bad results); you may be far from the truth

Page 66: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

66

Communications

General communication skills: Oral and written Formal and informal Talk to people with different level of

backgrounds Be clear, concise, accurate, and adaptive

(elaborate with examples, summarize by abstraction)

English proficiency Get used to talking to people from

different fields

Page 67: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

67

Persistence

Work only on topics that you are passionate about

Work only on hypotheses that you believe in

Don’t draw negative conclusions prematurely and give up easily positive results may be hidden in negative

results In many cases, negative results don’t

completely reject a hypothesis Be comfortable with criticisms about

your work (learn from negative reviews of a rejected paper)

Think of possibilities of repositioning a work

Page 68: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

68

Optimize Your Training

Know your strengths and weaknesses strong in math vs. strong in system

development creative vs. thorough …

Train yourself to fix weaknesses Find strategic partners Position yourself to take advantage of

your strengths

Page 69: Information Retrieval

Reach me on Twitter: @matifq Email [email protected]

69

Thank You

Reach me on Twitter: @matifqEmail me: [email protected]