Top Banner
Special Topics in Computer Science Special Topics in Computer Science The Art of Information The Art of Information Retrieval Retrieval Chapter 2: Modeling Chapter 2: Modeling Alexander Gelbukh www.Gelbukh.com
36

Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

Dec 31, 2015

Download

Documents

lynn-vang

Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling. Alexander Gelbukh www.Gelbukh.com. Previous chapter. User Information Need Vague Semantic, not formal Document Relevance Order, not retrieve Huge amount of information Efficiency concerns Tradeoffs - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

Special Topics in Computer ScienceSpecial Topics in Computer Science

The Art of Information RetrievalThe Art of Information Retrieval

Chapter 2: ModelingChapter 2: Modeling

Alexander Gelbukh

www.Gelbukh.com

Page 2: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

2

Previous chapterPrevious chapter

User Information Needo Vague

o Semantic, not formal

Document Relevanceo Order, not retrieve

Huge amount of informationo Efficiency concerns

o Tradeoffs

Art more than science

Page 3: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

3

ModelingModeling

Still science: computation is formal No good methods to work with (vague) semantics Thus, simplify to get a (formal) model Develop (precise) math over this (simple) model

Why math if the model is not precise (simplified)?

phenomenon model = step 1 = step 2 = ... = result

math

phenomenon model step 1 step 2 ... ?!

Page 4: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

4

Modeling in IR: ideaModeling in IR: idea

Tag documents with fieldso As in a (relational) DB: customer = {name, age, address}

o Unlike DB, very many fields: individual words!

o E.g., bag of words: {word1, word2, ...}: {3, 5, 0, 0, 2, ...}

Define a similarity measure between query and such a recordo Unlike DB, order, not retrieve (yes/no)

o Justify your model (optional, but nice)

Develop math and algorithms for fast accesso as relational algebra in DB

Page 5: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

Taxonomy of IR systemsTaxonomy of IR systems

Page 6: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

6

Aspects of an IR systemAspects of an IR system

IR modelo Boolean, Vector, Probabilistic

Logical view of documentso Full text, bag of words, ...

User tasko retrieval, browsing

Independent, though some are more compatible

Page 7: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

7

Taxonomy of IR modelsTaxonomy of IR models

Boolean (set theoretic)o fuzzy

o extended

Vector (algebraic)o generalized vector

o latent semantic indexing

o neural network

Probabilistico inference network

o belief network

Page 8: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

8

Taxonomy of other aspectsTaxonomy of other aspects

Text structure Non-overlapping lists Proximal nodes model

Browsing Flat Structure guided hypertext

Page 9: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

Appropriate modelsAppropriate models

Alexander Gelbukh
I did not understand the last column and last row. What about this picture is?
Page 10: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

10

Retrieval operation modeRetrieval operation mode

Ad-hoco static documentso interactiveo ordered

Filtering ( ad-hoc on new docs)o changing document collection

notification

o not interactive machine learning techniques can be used

o yes/no

Page 11: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

11

Characterization of an IR modelCharacterization of an IR model

D = {dj}, collection of formal representations of docso e.g., keyword vectors

Q = {qi}, possible formal representations of user information need (queries)

F, framework for modeling these two: reason for the next

R(qi,dj): Q D R, ranking functiono defines ordering

Page 12: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

Specific IR modelsSpecific IR models

Page 13: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

13

IR modelsIR models

Classicalo Boolean

o Vector

o Probabilistic

(clear ideas, but some disadvantages)

Refinedo Each one with refinements

o Solve many of the problems of the “basic” models

o Give good examples of possible developments in the area

o Not investigated well We can work on this

Page 14: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

14

Basic notionsBasic notions

Document: Set of index termo Mainly nouns

o Maybe all, then full text logical view

Term weightso some terms are better than others

o terms less frequent in this doc and more frequent in other docs are less useful

Documents index term vector {w1j, w2j, ..., wtj}o weights of terms in the doc

o t is the number of terms in all docs

o weights of different terms are independent (simplification)

Page 15: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

15

Boolean modelBoolean model

Weights {0, 1}o Doc: set of words

Query: Boolean expressiono R(qi,dj) {0, 1}

Good:o clear semantics, neat formalism, simple

Bad:o no ranking ( data retrieval), retrieves too many or too few

o difficult to translate User Information Need into query

No term weighting

Page 16: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

16

Vector modelVector model

Weights (non-binary) Ranking, much better results (for User Info Need) R(qi,dj) = correlation between query vector and doc v

ector E.g., cosine measure: (there is a typo in the book)

Page 17: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

Projection

Page 18: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

18

WeightsWeights

How are the weights wij obtained? Many variants.

One way: TF-IDF balance TF: Term frequency

o How well the term is related to the doc?o If appears many times, is importanto Proportional to the number of times that appears

IDF: Inverse document frequencyo How important is the term to distinguish documents?o If appears in many docs, is not importanto Inversely proportional to number of docs where appears

Contradictory. How to balance?

Page 19: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

19

TF-IDF rankingTF-IDF ranking

TF: Term frequency

IDF: Inverse document frequency

Balance: TF IDFo Other formulas exist. Art.

Page 20: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

20

Advantages of vector modelAdvantages of vector model

One of the best known strategies Improves quality (term weighting) Allows approximate matching (partial matching) Gives ranking by similarity (cosine formula) Simple, fast

But: Does not consider term dependencies

o considering them in a bad way hurts quality

o no known good way

No logical expressions (e.g., negation: “mouse & NOT cat”)

Page 21: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

21

Probabilistic modelProbabilistic model

Assumptions: o set of “relevant” docs, o probabilities of docs to be relevanto After Bayes calculation: probabilities of terms to be impo

rtant for defining relevant docs Initial idea: interact with the user.

o Generate an initial seto Ask the user to mark some of them as relevant or noto Estimate the probabilities of keywords. Repeat

Can be done without usero Just re-calculate the probabilities assuming the user’s acc

eptance is the same as predicted ranking

Page 22: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

22

(Dis)(Dis) advantages of Probabilistic modeladvantages of Probabilistic model

Advantage: Theoretical adequacy: ranks by probabilities

Disadvantages: Need to guess the initial ranking Binary weights, ignores frequencies Independence assumption (not clear if bad)

Does not perform well (?)

Page 23: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

23

Alternative Set Theoretic modelsAlternative Set Theoretic modelsFuzzy set modelFuzzy set model

Takes into account term relationships (thesaurus)o Bible is related to Church

Fuzzy belonging of a term to a documento Document containing Bible also contains “a little bit of”

Church, but not entirely

Fuzzy set logic applied to such fuzzy belongingo logical expressions with AND, OR, and NOT

Provides ranking, not just yes/no Not investigated well.

o Why not investigate it?

Alexander Gelbukh
Does it really support NOT?
Page 24: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

24

Alternative Set Theoretic modelsExtended Boolean modelExtended Boolean model

Combination of Boolean and Vector In comparison with Boolean model, adds “distance fro

m query”o some documents satisfy the query better than others

In comparison with Vector model, adds the distinction between AND and OR combinations

There is a parameter (degree of norm) allowing to adjust the behavior between Boolean-like and Vector-like

This can be even different within one query Not investigated well. Why not investigate it?

Page 25: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

25

Alternative Algebraic modelsAlternative Algebraic modelsGeneralized Vector Space modelGeneralized Vector Space model

Classical independence assumptions:o All combinations of terms are possible, none are equivale

nt (= basis in the vector space)

o Pair-wise orthogonal: cos ({ki}, {kj}) = 0

This model relaxes the pair-wise orthogonality:cos ({ki}, {kj}) 0

Operates by combinations (co-occurrences) of index terms, not individual terms

More complex, more expensive, not clear if better Not investigated well. Why not investigate it?

Page 26: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

26

Alternative Algebraic modelsLatent Semantic Indexing modelLatent Semantic Indexing model

Index by larger units, “concepts” sets of terms used together

Retrieve a document that share concepts with a relevant one (even if it does not contain query terms)

Group index terms together (map into lower dimensional space). So some terms are equivalent.o Not exactly, but this is the idea

o Eliminates unimportant details

o Depends on a parameter (what details are unimportant?)

Not investigated well. Why not investigate it?

Page 27: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

27

Alternative Algebraic modelsNeural Network modelNeural Network model

NNs are good at matching Iteratively uses the found documents as auxiliary que

rieso Spreading activation.

o Terms docs terms docs terms docs ...

Like a built-in thesaurus First round gives same result as Vector model No evidence if it is good Not investigated well. Why not investigate it?

Page 28: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

28

Alternative Probabilistic modelsAlternative Probabilistic modelsBayesian Inference Network modelBayesian Inference Network model

(One of the authors of the book worked in this. In fact not so important)

Probability as belief (not as frequency)o Belief in importance of terms. Query terms have 1.0

Similar to Neural Neto Documents found increase the importance of their termso Thus act as new querieso But different propagation formulas

Flexible in combining sources of evidence Can be applied to different ranking strategies (Boolean

or TF-IDF) Good quality of results (Warning! Authors work in this)

Page 29: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling
Page 30: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

30

Alternative Probabilistic modelsBelief Network modelBelief Network model

(Introduced by one of the authors of the book.)

Better network topologyo Separation of document and term space

o More general than Inference model

--------------------------------------------------------------------

Bayesian network models:o do not include cycles and this have linear complexity

unlike Neural Nets

o Combine distinct evidence sources (also user feedback)

o Are a neat formalism.

o Better alternative to combinations of Boolean and Vector

Page 31: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

31

Models for structured textModels for structured text

Cat in the 3rd chapter. Cat in same paragraph as Dog Non-overlapping lists

o Chapters, sections, paragraphs – as regions

o Technically treated much like terms (ranges of positions)

Sections containing Cat Proximal nodes model (suggested by the authors)

o Chapters, sections, paragraphs – as objects (nodes)

Page 32: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

32

Models for browsingModels for browsing

Flat browsingo Just as a list of paper

o No context cues provided

Structure guidedo Hierarchy

o Like directory tree in the computer

Hypertext (Internet!)o No limitations of sequential writing

o Modeled by a directed graph: links from unit A to unit B units: docs, chapters, etc.

o A map (with traversed path) can be helpful

Page 33: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

33

The WebThe Web

Internet Not hypertext

o Authors call “hypertext” a well-organized hypertext

o Internet: not depository but heap of information

Page 34: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

34

Research issuesResearch issues

How people judge relevance?o ranking strategies

How to combine different sources of evidence? What interfaces can help users to understand and

formulate their Information Need?o user interfaces: an open issue

Meta-search engines: combine results from different Web search engineso They almost do not intersect

o How to combine ranking?

Page 35: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

35

ConclusionsConclusions

Modeling is needed for formal operations Boolean model is the simplest Vector model is the best combination of quality and

simplicityo TF-IDF term weighting

o This (or similar) weighting is used in all further models

Many interesting and not well-investigated variationso possible future work

Page 36: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

36

Thank you!

Till October 2