Information Retrieval to Knowledge Retrieval , one more step

Information Retrieval to Knowledge Retrieval, one more step

Xiaozhong LiuAssistant Professor

School of Library and Information ScienceIndiana University Bloomington

What is Information?

What is Retrieval?

What is Information Retrieval?

I am Retriever

How to find this book in Library?

Search something based on User Information Need!!

How to express your information need?

Query

User Information Need!!

Query

What is Good query?What is Bad query?

Good query: query ≈ information needBad query: query ≠ information need

Wait!!! User NEVER make mistake!!!It’s OUR job!!!

Task 1: Given user information need, how to help (or automatically) help user propose a better query?

If there is a query… Perfect query:

𝑄𝑢𝑒𝑟𝑦𝑜𝑝𝑡𝑖𝑚𝑖𝑧𝑒

User input query:

𝑄𝑢𝑒𝑟𝑦𝑢𝑠𝑒𝑟


Query ResultsGiven a query,How to retrieve results?

What is Good results?What is Bad results?

Task 2: Given a query (not perfect), how to retrieveDocuments from collection?

Very Large, UnstructuredText Data!!!

F(query, doc)

Can you give me an example?

F(query, doc)

If query term exist in docYes, this is result

If query term NOT exist in docNo, this is not result

Is there any problem in this function?Brainstorm…

Query: Obama’s wife

Doc 1. My wife supports Obama’s new policy on…

Doc 2. Michelle, as the first lady of the United States…

Yes, this is a very challenging task!

Another problem Collection size: 5 billionMatch doc: 5

My algorithm successfully finds all the 5 docs! In… 3 billion results…


Query Results

How to help user find the results from all the

retrieved results?

Task 3: Given retrieved results, how to help you find their results?

If retrieval algorithm retrieved 1 billion results from collection, what will you do???

Search with Google, click “next”???

Yes, we can help user find what they need!

Query: Indiana University Bloomington

Can you read it One by one?

You use it??


Query Results

1

2

3

User

System


Query Results

1

2

3

User

System

They are not independent!

Information Retrieval

Text

Image

Music

Map

……

Information Retrieval

Text

Image

Music

Map

……

documentweb

scholar

blog

news

Index

Documents vs. Database Records• Relational database records are typically made up of well-

defined fields Select * from students where GPA > 2.5

We need a more effective way to index the text!

Text, similar way? Find all the docs including “Xiaozhong”

Select * from documents where text like ‘%xiaozhong%’

Collection C: doc1, doc2, doc3 ……… docN

Query q: q1, q2, q3 ……… qt where qx is the query term

Document doci : di1, di2, di3 ……… dim All dij V

Vocabulary V: w1, w2, w3 ……… wn


V: w1, w2, w3 ……… wn

Doc1 1 0 0 1

Doc2 0 0 0 1

Doc3 1 1 1 1

DocN 1 0 1 1

………

Query q: 0, 1, 0 ………


V: w1, w2, w3 ……… wn

Doc1 3 0 0 9

Doc2 0 0 0 7

Doc3 2 11 21 1

DocN 7 0 1 2

………

Query q: 0, 3, 0 ………

Normalization is very important!


V: w1, w2, w3 ……… wn

Doc1 0.41 0 0 0.62

Doc2 0 0 0 0.12

Doc3 0.42 0.11 0.34 0.13

DocN 0.01 0 0.19 0.24

………

Query q: 0, 0.37, 0 ………

Normalization is very important!

Weight

Term weighting

TF * IDF

Term frequency: freq (w, doc) / | doc|Or…

Inverse document frequency1+ log(N/k)N total num of docs in collectionk total num of docs with word w

An effective way to weight each word in a document

Index

Space?

Speed?

Retrieval Model?

Ranking?

Semantic?

Document representation meets the requirement of retrieval system

StemmingEducation

Educational

Educate

EducatingEducations

Educat

Very effective to improve system performance.

Some risk! E.g. LA Lakers = LA Lake?

Doc 1: I love my cat.Doc 2: This cat is lovely!Doc 3: Yellow cat and white cat.

Inverted index

I love my cat this is lovely yellow and write

i love cat thi yellow write i - 1love - 1, 2thi - 2cat - 1, 2, 3yellow - 3write - 3

We lose something?


Inverted index

i - 1love - 1, 2thi - 2cat - 1, 2, 3yellow - 3write - 3

i – 1:1love – 1:1, 2:1thi – 2:1cat – 1:1, 2:1, 3:2yellow – 3:1write – 3:2

We still lose something?


Inverted index

i – 1:1love – 1:1, 2:1thi – 2:1cat – 1:1, 2:1, 3:2yellow – 3:1write – 3:2

i – 1:1love – 1:2, 2:4thi – 2:1cat – 1:4, 2:2, 3:2, 3:5yellow – 3:2write – 3:4

Why do you need position info?

Doc 1: information retrieval is important for digital library.

Doc 2: I need some information about the dogs, my favorite is golden retriever.

Proximity of query terms query: information retrieval

Doc 1: information retrieval is important for digital library.

Doc 2: I need some information about the dogs, my favorite is golden retriever.

Index – bag of wordsquery: information retrieval

What’s the limitation of bag-of-words? Can we make it better?

n-gram:

Doc 1: information retrieval, retrieval is, is important, important for ……

bi-gram

Better semantic representation!What’s the limitation?

Doc 1: …… big apple ……

Doc 2: …… apple……

Index – bag of “phrase”?

More precision, less ambiguous

How to identify phrases from documents?

Identify syntactic phrases using POS taggingn-gramsfrom existing resources

Noise detection

What is the noise of web page? Non-informative content…

Web Crawler - freshness

Web is changing, but we cannot constantly check all the pages…

Need to find the most important page that change freq

www.nba.com

www.iub.edu

www.restaurant????.com

Sitemap: a list of urls for each host; modification time and freq

Retrieval

Model

Mathematical modeling is frequently used with the objective to understand, explain, reason and predict behavior or phenomenon in the real world (Hiemstra, 2001).

i.e. some model help you to predict tomorrow stock price…

Hypothesis:

Retrieval and ranking problem = Similarity Problem!

Vector Space Model

Is that a good hypothesis? Why?

Retrieval Function: Similarity (query, Document)

Return a score!!! We can Rank the documents!!!

So, query is a short document

Vector Space Model


V: w1, w2, w3 ……… wn

Doc1 0.41 0 0 0.62

Doc2 0 0 0 0.12

Doc3 0.42 0.11 0.34 0.13

DocN 0.01 0 0.19 0.24

………

Query q: 0, 0.37, 0 ………


V: w1, w2, w3 ……… wn

Doc1 0.41 0 0 0.62

Doc2 0 0 0 0.12

Doc3 0.42 0.11 0.34 0.13

DocN 0.01 0 0.19 0.24

………

Query q: 0, 0.37, 0 ………

Similarity

Doc Vector

Query Vector

Doc1: ……Cat……dog……cat……Doc2: ……Cat……dogDoc3: ……snake……

Query: dog cat cat

dog

2

1

doc 1

doc 2

doc 3


Query: dog cat

F (q, doc) = cosine similarity (q, doc)

cat

dog

2

1

doc 1

doc 2 = query

doc 3

θ

Why Cosine?

Vector Space Model

Dimension = n = vocabulary size

Query q: q1, q2, q3 ……… qn Same dimensional space!!!Document doci : di1, di2, di3 ……… din All dij V

Vocabulary V: w1, w2, w3 ……… wn


Query: dog cat

Try!

Term weighting

Doc [ 0.42 0.11 0.34 0.13 ]

weight, how?

TF * IDF

Term frequency: freq (w, doc) / | doc|Or…

Inverse document frequency1+ log(N/k)N total num of docs in collectionk total num of docs with word w

More TF

Weighting is very important for retrieval model!We can improve TF by…

i.e.freq (term, doc)log [freq (term, doc)]

BM25:

Vector Space Model

But…

Bag of word assumption = Word independent!

Query = Document, maybe not true!

Vector and SEO (Search Engine Optimization)…

synonym? Semantic related words?

How about these…

Pivoted Normalization Method

Dirichlet Prior Method

TF IDFNormalization

+parameter

Language model

Probability distribution over words

P (I love you) = 0.01P (you love I) = 0.00001P (love you I) = 0.0000001

If we have this information… we could build a generative model!

P(text | )

Language model - unigram

Generate text with bag-of-word assumption (word independent):

P (w1, w2,…wn) = P(w1) P(w2)…P(wn)

food orange desk USB computer Apple Unix …. …. …. milk sport superbowl

topic X = ???

food orange desk USB computer Apple Unix …. …. milk yogurt iPad NBA sport superbowl NHL score information unix USB

topic 1topic 2

Doc: I’m using Mac computer… remote access another computer… share some USB device…

P(Doc | topic1) vs. P(Doc | topic2)

king ghost hamlet play …. …. romeo juliet iPad iplhone 4s tv apple …… play store

food orange desk USB computer Apple Unix …. …. …. milk sport superbowl

topicX

How to estimate???

If we have enough data, i.e. docs about topic X

10/10000 1000/10000 30/10000

P(“computer” | topic X)

food orange desk USB computer Apple Unix …. …. milk yogurt iPad NBA sport superbowl NHL score information unix USB

doc 1doc 2

query: sport game watch

P(query | doc 1) vs. P(query | doc 2)

a document doc:

query likelihood query term likelihood

Retrieval Problem Query likelihood Term likelihood P(qi | doc)

But document is a small sample of topic… Data like:

Smoothing!

P(qi | doc) What if qi is not observed in doc? P(qi | doc) = 0?

We want give this non-zero score!!!

Smoothing

i.e.

We can make it better!

Smoothing

First, it addresses the data sparseness problem. As a document is only a very small sample, the probability P (qi | Doc) could be zero for those unseen words (Zhai & Lafferty, 2004).

Second, smoothing helps to model the background (non-discriminative) words in the query.

Improve language model estimation by using Smoothing

Smoothing

Another smoothing method:

P (w | )

if the word exist in doc

if the word not exist in doc

P (w | doc)

P (w | collection) Collection Language Model

P (w | ) = (1-λ) ∙P( query | θdoc)+λ∙P(doc| θcollection)

Smoothing

We could use collection language model:

TFIDF is closely related to Language Model and other retrieval models

Term Freq

IDFDoc length norm

Language model

Solid statistical foundation

Flexible parameter setting

Different smoothing method

Language model in library?

If we have a paper… and a query…

Similarity (paper, query) Vector Space Model

If query word not in the paper…

Score = 0

If we use language model…

Language model in library?

Likelihood of query given a paper can be estimated by:

P(query | ) = αP (query | paper) + βP (query | author) +γP (query | journal) +……

Likelihood of query given a paper & author & journal & ……

e.g. what’s the difference between web and doc retrieval???

F (doc, query)

F (web page, query)

vs

web page = doc + hyperlink + domain info + anchor text + metadata + …Can you use those to improve system performance???

Knowledge

Score each topic, level of interest

Topic 1

Topic 2

CI-n … CI-2 CI-1 CI-now

)|({)]|([/)|()]}|([)]|([)|({

)]|([/)|()]}|([)]|([)|({)(

ntoday

nintodayninintoday

nintodayninintoday

n

ZDayPelseZDayPmeanZDayPbZDayPSTDZDayPmeanZDayPifelse

ZDayPmeanZDayPaZDayPSTDZDayPmeanZDayPifTopicScore

Hot topic Diminishing topic Regular topic

CurrentInterestHistorical Interest

“Obama”, Nov 5th 2008 After Election

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 300

1

2

3

4

5

6

Nov 5th CIV:

Wiki:Barack_Obama; Wiki:Election; win; success; Wiki:President_of_the_United_States

Wiki:African_American; PresidentWorld; America; victory; record; first;president ; 44th; History; Wiki:Victory_Records

Entity:first_black_president;

Entity:first_black_president; Celebrate; black; african;

Wiki:Colin_Powell; Wiki:Secretary_of_StateWiki:United_States

Wiki:Sarah_Palin; sarah; palin; hillarySecret; Wiki:Hillary_Rodham_Clinton

Clinton; newsweek; club; cloth

1. Win2. Create history3. First black president

Google web NDCG3 NDCG5 NDCG10 t-testCIV 0.35909366 0.399970894 0.479302401 　CILM 0.356652652 0.387120299 0.483420045 　Google 0.230423817 0.318737414 0.388792379 **TFIDF 0.27596245 0.333012091 0.437831859 *BM25 0.284599431 0.336961764 0.436466778 *LM (liner) 0.32558799 0.382113457 0.473992963 　LM (dirichlet) 0.34665084 0.358128576 0.45150825 　LM (twostage) 0.349735965 0.358725227 0.450046444 　BEST1: CIV CIV CILM 　BEST2: CILM CILM CIV 　Significant test *** t < 0.05 ** t < 0.10 * t < 0.15

Yahoo_web NDCG3 NDCG5 NDCG10 t-testCIV 0.351765133 0.38207777 0.475506721 　CILM 0.391807685 0.40623334 0.482464858 　Yahoo 0.288059321 0.326373542 0.410969176 　TFIDF 0.24320988 0.282799657 0.404092457 ***BM25 0.245263974 0.277579262 0.395953269 ***LM (liner) 0.276208943 0.316889107 0.432428784 *LM (dirichlet) 0.223253393 0.270017519 0.385936078 ***LM (twostage) 0.219225991 0.266537146 0.384349848 ***BEST1: CILM CILM CILM 　BEST2: CIV CIV CIV 　Significant test *** t < 0.05 ** t < 0.10 * t < 0.15

Knowledge Retrieval System

Knowledge-based Information Need

Knowledge within Scientific Literature

Matching

Query Knowledge Representation

How to help user propose

knowledge-base queries ?

How to represent

knowledge?

How to match

between the two?

Academic Knowledge

74

Query Recommendation & Feedback

Query Recommendation

Query Feedback

76

Structural Keyword Generation- FeaturesCategory Feature Description or Example

Keyword Content

Text content of the keyword, stemmed, case insensitive, stop words removed

Content_Of_Keyword a vector of all the tokens in the keywordCAP whether the keyword is capitalized

Contain_Digit whether the keyword contains digits, i.e., TREC2002, value = trueCharacter_Length_Of_Keyword number of characters in the target keyword

Token_Length_Of_Keyword number of tokens in the keyword

Category_Length_Of_Keyword number of tokens in the keyword; if the length is more than four, we use four to represent its category length

Title Context

Exist_In_Title whether keyword exists in title (stemmed, case insensitive, stop words removed)

Location_In_Title the position where the keyword appears in the titleTitle_Text_POS unigram and its part of speech in title (in a text window)Title_Unigram unigram of keyword in title (in a text window)Title_Bigram bigram of keyword in title (in a text window)

Abstract Context

Location_In_Abstract which sentence the keyword appears in the abstractKeyword_Position_In_Sentence_O

f_Abstract the keyword’s position in the sentence (beginning, middle or end)

Abstract_Freq how many times a keyword appears in the abstractAbstract_Text_POS unigram and its part of speech in abstract (in a text window)Abstract_Unigram unigram of keyword in abstract (in a text window)Abstract_Bigram bigram of keyword in abstract (in a text window)

Evaluation – Domain Knowledge Generation

F1 Compare Concept Supervised Semi-supervised

Keyword-based

features

Research Question 0.637 0.662

Methodology 0.479 0.516Dataset 0.824 0.816

Evaluation 0.571 0.571

Keyword + Title-based

features




Keyword + Title +

Abstract-based

features




F measure comparison for Supervised Learning and Semi-Supervised Learning

GOOD! but not PERFECT…

Knowledge comes from…

• System? Machine Learning, but… low modest performance…

• User? No way! Very high cost! Author won’t contribute…

• System + User? Possible!

WikiBackyard

ScholarWiki

EditTrigger: 1. Wiki page improve; 2. Machine learning model improve; 3. All other wiki pages improve; 4. KR index improve!

User + Machine learning is powerful…YES! It helps!!!

• Knowledge retrieval for scholar publications…• Knowledge from paper• Knowledge from user– Knowledge feedback– Knowledge recommendation

• Knowledge from User vs. from Machine learning

• ScholarWiki (user) + WikiBackyard (machine)

Knowledge via Social Network and Text Mining

CITATION? CO-OCCUR?CO-AUTHOR?

Content of each node?Motivation of each citation?

With further study of citation analysis, increasing numbers of researchers have come to doubt the reasonableness of assuming that the raw number of citations reflects an article’s influence (MacRoberts & MacRoberts 1996). On the other hand, full-text analysis has to some extent compensated for the weaknesses of citation counts and has offered new opportunities for citation analysis.

Full text citation analysis


Every word @ Citation Context will VOTE!! Motivation? Topic? Reason??? Left and Right N words??N = ??????????


Word effectiveness is decaying based on the distance!!!

Closer words make more significant contribution!!

How about language model? Each node and edge represented by a language model?High dimensional space! Word difference?

Topic modeling – each node is represented by a topic distribution (Prior Distribution); each edge is represented by a topic distribution (Transitioning Probability Distribution)

Supervised topic modeling

1. Each topic has a label (YES! We can interpret each topic)2. We DO KNOW the total number of topics

Each paper is a mix probability distribution of Author Given Keywords

Keywords

Each paper: pzkeyi(paper) = p(zkeyi | abstract, title)


Paper importance

if we have 3 topics (keywords): key1, key2, key3

Domain credit: 100

pub 1

25

pub 2

25

pub 3

25

pub 4

25

P(key1 | text) = 0.6P(key2 | text) = 0.15 P(key3 | text) = 0.25

Key1-Pub1 credit: 25 * 0.6

P(key1 | citation) = 0.8P(key2 | citation) = 0.1 P(key3 | citation) = 0.1

Key1-Citation1 credit: 25 * 0.6*[0.8/(0.8+0.2)]

0.80.2

Evenly share the credits?

Citation is important if 1. citation focusing on important topic 2. other citations focusing on other topics

Paper importance

if we have 3 keywords: key1, key2, key3

Domain credit: 100

pub 1

25

pub 2

25

pub 3

25

pub 4

25

Key1-Pub1 credit: 25 * 0.6

Key1-Citation1 credit: 25 * 0.6*[0.8/(0.8+0.2)]

0.80.2

[25,25,25]

[29,26,28] [27,27,26]

[25,25,25]

Domain publication rankingDomain keyword topical rankingTopical citation tree

Citation number between paper pair is IMPORTANT!

Different citations make different contribution to different topics (keywords) to the citing publication.

Publication/venue/author topic prior

Citation transitioning topic prior

nDCG@10 nDCG@30 nDCG@50 nDCG@100 nDCG@300 nDCG@500 nDCG@1000 nDCG@3000 nDCG@5000 nDCG@ALL0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

NDCG for Review citation recommendationN

DCG

Literature Review Citation recommendation

Input: Paper Abstract

Output: A list of ranked citations

MAP and NDCG evaluation

Given a paper abstract:

1. Word level match (language model)2. Topic level match (KL-Divergence)3. Topic importance

Use Inference Network to integrate each hypothesis

Citation Recommendation

Content MatchPublication

Topical Prior

1. PageRank2. Full-text PageRank (greedy match)3. Full-text PageRank (topic modeling)

Topic match

Inference Network

Input

Output:

1. [3] YES 32. [2] YES 23. [6] NO 04. [8] NO 05. [10] YES 16. [1] NO 0……

MAP(Cite or not?)

NDCG(Important citation?)

nDCG@10 nDCG@30 nDCG@50 nDCG@100 nDCG@300 nDCG@500 nDCG@1000 nDCG@3000 nDCG@5000 [email protected]

0.15

0.2

0.25

0.3

0.35

0.4NDCG for citation recommendation based on Abstract

Based on greedy match, 1 second

Based on topic inference, 30 seconds

CONCLUSION

• Information Retrieval• Index• Retrieval Model• Ranking• User feedback• Evaluation

• Knowledge Retrieval• Machine Learning• User Knowledge• Integration • Social Network Analysis

Thank you!