Top Banner
Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign
30

Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

Jan 04, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

Overview of Information

Retrieval

(CS598-CXZ Advanced Topics in IR Presentation)

Jan. 18, 2005

ChengXiang Zhai

Department of Computer Science

University of Illinois, Urbana-Champaign

Page 2: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

What is Information Retrieval (IR)?

•Narrow-sense: – IR= Search Engine Technologies (IR=Google, library

info system)

– IR= Text matching/classification

•Broad-sense: IR = Text Information Management:– Gneral problem: how to manage text information?

– How to find useful information? (info. retrieval) (e.g., google)

– How to organize information? (text classification) (e.g., automatically assign email to different folders)

– How to discover knowledge from text? (text mining) (e.g., discover correlation of events)

Page 3: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

Why is IR Important?

•More and more online information in general (Information Overload)

•Many tasks rely on effective management and exploitation of information

•Textual information plays an important role in our lives

•Effective text management directly improves productivity

Page 4: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

Elements of Text Info Management Technologies

Search

Text

Filtering

Categorization

Summarization

Clustering

Natural Language Content Analysis

Extraction

Mining

VisualizationRetrievalApplications

MiningApplications

InformationAccess

KnowledgeAcquisition

InformationOrganization

Page 5: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

A Quick Tour of the State of the Art….

Page 6: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

Component Technology 1:Natural Language

Processing

Page 7: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

What is NLP? �ه� … $ه"ل أ و$م$ع$ ه� $ف"س� ن م$ع$ , و$ص$اد�ق$ا , $ا "ن أم�ي 5ون$ $ك ي أن ان� "س$ اإلن ع$ل$ى $ج�ب5 ي

ع$ل$ى $ع"م$ل$ ي $ن" و$أ الو$ط$ن� ن�" أ ش$ �ع"الء� إ ف�ي Kج5ه"د O5ل ك "ذ5ل$ $ب ي $ن" و$أ �ه� ان "ر$ ي و$ج�

… م$ا

How can a computer make sense out of this string?

Arabic text

- What are the basic units of meaning (words)?- What is the meaning of each word? - How are words related with each other? - What is the “combined meaning” of words? - What is the “meta-meaning”? (speech act)- Handling a large chunk of text- Making sense of everything

Syntax

Semantics

Pragmatics

Morphology

DiscourseInference

Page 8: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

An Example of NLP

A dog is chasing a boy on the playgroundDet Noun Aux Verb Det Noun Prep Det Noun

Noun Phrase Complex Verb Noun PhraseNoun Phrase

Prep PhraseVerb Phrase

Verb Phrase

Sentence

Dog(d1).Boy(b1).Playground(p1).Chasing(d1,b1,p1).

Semantic analysis

Lexicalanalysis

(part-of-speechtagging)

Syntactic analysis(Parsing)

A person saying this maybe reminding another person to

get the dog back…

Pragmatic analysis(speech act)

Scared(x) if Chasing(_,x,_).+

Scared(b1)

Inference

Page 9: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

What we can do in NLP

A dog is chasing a boy on the playgroundDet Noun Aux Verb Det Noun Prep Det Noun

Noun Phrase Complex Verb Noun PhraseNoun Phrase

Prep PhraseVerb Phrase

Verb Phrase

Sentence

Semantics: some aspects

- Entity/relation extraction- Word sense disambiguation- Anaphora resolution

POSTagging:

97%

Parsing: partial >90%(?)

Speech act analysis: ???

Inference: ???

Page 10: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

What We Can’t Do in NLP

•100% POS tagging– “He turned off the highway.” vs “He turned off the fan.”

•General complete parsing– “A man saw a boy with a telescope.”

•Deep semantic analysis– Will we ever be able to precisely define the meaning of

“own” in “John owns a restaurant.”?

Robust & general NLP tends to be “shallow” …

“Deep” understanding doesn’t scale up …

Page 11: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

Component Technology 2:Search (ad hoc retrieval)

Page 12: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

What is Search (Ad hoc IR)?

RetrievalSystem User

“robotics applications”

query

Robotics

others

relevant docs

non-relevant docs

database/collection

text docs

Page 13: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

What we can do in Search

•Search in a pure text collection is well studied

– Many different methods

– Equally effective when optimized

•Basic search techniques (e.g., vector space, prob. models) are good enough for commercialization

– All implementing TF-IDF style heuristics

– Some new models have more potential for further optimization

Page 14: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

What we can’t do in Search

•Basic retrieval models

– No single model is the best on all test collections

– Automatic parameter optimization

•Lack of interactive search support

•Lack of personalization

•Search context modeling

•Retrieval with more than pure text

– With structures

– Multi-media

Page 15: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

Component Technology 3:Information Filtering

Page 16: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

What is Information Filtering?

•Stable & long term interest, dynamic info source

•System must make a delivery decision immediately as a document “arrives”

FilteringSystem

my interest:

Page 17: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

State of the Art: Filtering

•Content-based adaptive filtering

– Basic techniques, though not perfect, are there

– We haven’t seen many (any?) filtering applications

•Collaborative filtering (recommender systems)

– Simple methods can be (are being) commercialized

– Real applications exist

– More applications are possible

Page 18: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

Component Technology 4:Text Categorization

Page 19: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

What is Text Categorization?

•Pre-given categories and labeled document examples (Categories may form hierarchy)

•Classify new documents

•A standard supervised learning problem

CategorizationSystem

Sports

Business

Education

Science…

SportsBusiness

Education

Page 20: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

State of the Art: Categorization

•Many supervised learning methods have been developed– SVM is often the best in performance

– Other methods are also competitive

– Commercial applications exist, but not at a large-scale

– More applications can be developed

•Feature selection/extraction is often more important than the choice of the learning algorithm

•Applications have been developed

•Relatively well explored

Page 21: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

Component Technology 5:Clustering

Page 22: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

The Clustering Problem

•Discover “natural structure”

•Group similar objects together

•Object can be document, term, passages

•Example

Page 23: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

State of the Art: Clustering

•Many methods have been developed, applicable in different situations

•Difficult to predict which method is the best

•When patterns are clear, most methods work well

•In difficult situations– Special clustering bias must be incorporated

– Properties of clustering methods need to be considered

Page 24: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

End of State of the Art Tour…

Page 25: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

Where is IR Going?

•IR and related areas

•Current trends

•How would this course fit to the picture?

Page 26: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

Related Areas

InformationRetrieval Databases

Library & InfoScience

Machine LearningPattern Recognition

Data Mining

NaturalLanguageProcessing

ApplicationsWeb, Bioinformatics…

StatisticsOptimization

Software engineeringComputer systems

Models

Algorithms

Applications

Systems

Page 27: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

Current Trends

InformationRetrieval Databases

Library & InfoScience

Machine LearningPattern Recognition

Data Mining

NaturalLanguageProcessing

ApplicationsWeb, Bioinformatics…

StatisticsOptimization

Software engineeringComputer systems

Models

Algorithms

Applications

Systems

Web/ Bioinformatics/…

Literature/Digital Library

Structured + Unstructured Data

Human-Computer InteractionsHigh-Performance Computing

More PowerfulContent Analysis

More PrincipledModels/Algorithms

Page 28: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

Publications/Societies

ACM SIGIR

VLDB, PODS, ICDE

ASIS

Learning/Mining

NLP

Applications

Statistics??

Software/systems??

COLING, EMNLP, ANLP

HLT

ICML, NIPS, UAIRECOMB, PSB

JCDL

Info. Science

Info Retrieval

ACM CIKM, TREC

DatabasesACM SIGMOD

ACL

ICML

AAAI

ACM SIGKDD

ISMB WWW

Page 29: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

Let Users Lead the Way…

•The underlying driving force has always been real world applications

•The ultimate impact of research in IR is to benefit people in accessing and using information in the real world

•Research on many component technologies is reaching a stage of “diminishing return”; the challenge is how to make use of such imperfect techniques

•Think more about complete solutions (as opposed to component technologies) as well as new applications

Page 30: Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

How would this Course Fit to the Picture?

•Identify novel application problems

•Identify new research topics

•Examine existing research work in these directions

•Design and carry out new projects in some of the directions

•We will broadly look at 3 application domains: Web, Email, and Literature