Information Retrieval to Knowledge Retrieval, one more step Xiaozhong Liu Assistant Professor School of Library and Information Science Indiana University Bloomington
Feb 25, 2016
Information Retrieval to Knowledge Retrieval, one more step
Xiaozhong LiuAssistant Professor
School of Library and Information ScienceIndiana University Bloomington
What is Information?
What is Retrieval?
What is Information Retrieval?
I am Retriever
How to find this book in Library?
Search something based on User Information Need!!
How to express your information need?
Query
User Information Need!!
Query
What is Good query?What is Bad query?
Good query: query ≈ information needBad query: query ≠ information need
Wait!!! User NEVER make mistake!!!It’s OUR job!!!
Task 1: Given user information need, how to help (or automatically) help user propose a better query?
If there is a query… Perfect query:
𝑄𝑢𝑒𝑟𝑦𝑜𝑝𝑡𝑖𝑚𝑖𝑧𝑒
User input query:
𝑄𝑢𝑒𝑟𝑦𝑢𝑠𝑒𝑟
User Information Need!!
Query ResultsGiven a query,How to retrieve results?
What is Good results?What is Bad results?
Task 2: Given a query (not perfect), how to retrieveDocuments from collection?
Very Large, UnstructuredText Data!!!
F(query, doc)
Can you give me an example?
F(query, doc)
If query term exist in docYes, this is result
If query term NOT exist in docNo, this is not result
Is there any problem in this function?Brainstorm…
Query: Obama’s wife
Doc 1. My wife supports Obama’s new policy on…
Doc 2. Michelle, as the first lady of the United States…
Yes, this is a very challenging task!
Another problem Collection size: 5 billionMatch doc: 5
My algorithm successfully finds all the 5 docs! In… 3 billion results…
User Information Need!!
Query Results
How to help user find the results from all the
retrieved results?
Task 3: Given retrieved results, how to help you find their results?
If retrieval algorithm retrieved 1 billion results from collection, what will you do???
Search with Google, click “next”???
Yes, we can help user find what they need!
Query: Indiana University Bloomington
Can you read it One by one?
You use it??
User Information Need!!
Query Results
1
2
3
User
System
User Information Need!!
Query Results
1
2
3
User
System
They are not independent!
Information Retrieval
Text
Image
Music
Map
……
Information Retrieval
Text
Image
Music
Map
……
documentweb
scholar
blog
news
Index
Documents vs. Database Records• Relational database records are typically made up of well-
defined fields Select * from students where GPA > 2.5
We need a more effective way to index the text!
Text, similar way? Find all the docs including “Xiaozhong”
Select * from documents where text like ‘%xiaozhong%’
Collection C: doc1, doc2, doc3 ……… docN
Query q: q1, q2, q3 ……… qt where qx is the query term
Document doci : di1, di2, di3 ……… dim All dij V
Vocabulary V: w1, w2, w3 ……… wn
Collection C: doc1, doc2, doc3 ……… docN
V: w1, w2, w3 ……… wn
Doc1 1 0 0 1
Doc2 0 0 0 1
Doc3 1 1 1 1
DocN 1 0 1 1
………
Query q: 0, 1, 0 ………
Collection C: doc1, doc2, doc3 ……… docN
V: w1, w2, w3 ……… wn
Doc1 3 0 0 9
Doc2 0 0 0 7
Doc3 2 11 21 1
DocN 7 0 1 2
………
Query q: 0, 3, 0 ………
Normalization is very important!
Collection C: doc1, doc2, doc3 ……… docN
V: w1, w2, w3 ……… wn
Doc1 0.41 0 0 0.62
Doc2 0 0 0 0.12
Doc3 0.42 0.11 0.34 0.13
DocN 0.01 0 0.19 0.24
………
Query q: 0, 0.37, 0 ………
Normalization is very important!
Weight
Term weighting
TF * IDF
Term frequency: freq (w, doc) / | doc|Or…
Inverse document frequency1+ log(N/k)N total num of docs in collectionk total num of docs with word w
An effective way to weight each word in a document
Index
Space?
Speed?
Retrieval Model?
Ranking?
Semantic?
Document representation meets the requirement of retrieval system
StemmingEducation
Educational
Educate
EducatingEducations
Educat
Very effective to improve system performance.
Some risk! E.g. LA Lakers = LA Lake?
Doc 1: I love my cat.Doc 2: This cat is lovely!Doc 3: Yellow cat and white cat.
Inverted index
I love my cat this is lovely yellow and write
i love cat thi yellow write i - 1love - 1, 2thi - 2cat - 1, 2, 3yellow - 3write - 3
We lose something?
Doc 1: I love my cat.Doc 2: This cat is lovely!Doc 3: Yellow cat and white cat.
Inverted index
i - 1love - 1, 2thi - 2cat - 1, 2, 3yellow - 3write - 3
i – 1:1love – 1:1, 2:1thi – 2:1cat – 1:1, 2:1, 3:2yellow – 3:1write – 3:2
We still lose something?
Doc 1: I love my cat.Doc 2: This cat is lovely!Doc 3: Yellow cat and white cat.
Inverted index
i – 1:1love – 1:1, 2:1thi – 2:1cat – 1:1, 2:1, 3:2yellow – 3:1write – 3:2
i – 1:1love – 1:2, 2:4thi – 2:1cat – 1:4, 2:2, 3:2, 3:5yellow – 3:2write – 3:4
Why do you need position info?
Doc 1: information retrieval is important for digital library.
Doc 2: I need some information about the dogs, my favorite is golden retriever.
Proximity of query terms query: information retrieval
Doc 1: information retrieval is important for digital library.
Doc 2: I need some information about the dogs, my favorite is golden retriever.
Index – bag of wordsquery: information retrieval
What’s the limitation of bag-of-words? Can we make it better?
n-gram:
Doc 1: information retrieval, retrieval is, is important, important for ……
bi-gram
Better semantic representation!What’s the limitation?
Doc 1: …… big apple ……
Doc 2: …… apple……
Index – bag of “phrase”?
More precision, less ambiguous
How to identify phrases from documents?
Identify syntactic phrases using POS taggingn-gramsfrom existing resources
Noise detection
What is the noise of web page? Non-informative content…
Web Crawler - freshness
Web is changing, but we cannot constantly check all the pages…
Need to find the most important page that change freq
www.nba.com
www.iub.edu
www.restaurant????.com
Sitemap: a list of urls for each host; modification time and freq
Retrieval
Model
Mathematical modeling is frequently used with the objective to understand, explain, reason and predict behavior or phenomenon in the real world (Hiemstra, 2001).
i.e. some model help you to predict tomorrow stock price…
Hypothesis:
Retrieval and ranking problem = Similarity Problem!
Vector Space Model
Is that a good hypothesis? Why?
Retrieval Function: Similarity (query, Document)
Return a score!!! We can Rank the documents!!!
So, query is a short document
Vector Space Model
Collection C: doc1, doc2, doc3 ……… docN
V: w1, w2, w3 ……… wn
Doc1 0.41 0 0 0.62
Doc2 0 0 0 0.12
Doc3 0.42 0.11 0.34 0.13
DocN 0.01 0 0.19 0.24
………
Query q: 0, 0.37, 0 ………
Collection C: doc1, doc2, doc3 ……… docN
V: w1, w2, w3 ……… wn
Doc1 0.41 0 0 0.62
Doc2 0 0 0 0.12
Doc3 0.42 0.11 0.34 0.13
DocN 0.01 0 0.19 0.24
………
Query q: 0, 0.37, 0 ………
Similarity
Doc Vector
Query Vector
Doc1: ……Cat……dog……cat……Doc2: ……Cat……dogDoc3: ……snake……
Query: dog cat cat
dog
2
1
doc 1
doc 2
doc 3
Doc1: ……Cat……dog……cat……Doc2: ……Cat……dogDoc3: ……snake……
Query: dog cat
F (q, doc) = cosine similarity (q, doc)
cat
dog
2
1
doc 1
doc 2 = query
doc 3
θ
Why Cosine?
Vector Space Model
Dimension = n = vocabulary size
Query q: q1, q2, q3 ……… qn Same dimensional space!!!Document doci : di1, di2, di3 ……… din All dij V
Vocabulary V: w1, w2, w3 ……… wn
Doc1: ……Cat……dog……cat……Doc2: ……Cat……dogDoc3: ……snake……
Query: dog cat
Try!
Term weighting
Doc [ 0.42 0.11 0.34 0.13 ]
weight, how?
TF * IDF
Term frequency: freq (w, doc) / | doc|Or…
Inverse document frequency1+ log(N/k)N total num of docs in collectionk total num of docs with word w
More TF
Weighting is very important for retrieval model!We can improve TF by…
i.e.freq (term, doc)log [freq (term, doc)]
BM25:
Vector Space Model
But…
Bag of word assumption = Word independent!
Query = Document, maybe not true!
Vector and SEO (Search Engine Optimization)…
synonym? Semantic related words?
How about these…
Pivoted Normalization Method
Dirichlet Prior Method
TF IDFNormalization
+parameter
Language model
Probability distribution over words
P (I love you) = 0.01P (you love I) = 0.00001P (love you I) = 0.0000001
If we have this information… we could build a generative model!
P(text | )
Language model - unigram
Generate text with bag-of-word assumption (word independent):
P (w1, w2,…wn) = P(w1) P(w2)…P(wn)
food orange desk USB computer Apple Unix …. …. …. milk sport superbowl
topic X = ???
food orange desk USB computer Apple Unix …. …. milk yogurt iPad NBA sport superbowl NHL score information unix USB
topic 1topic 2
Doc: I’m using Mac computer… remote access another computer… share some USB device…
P(Doc | topic1) vs. P(Doc | topic2)
king ghost hamlet play …. …. romeo juliet iPad iplhone 4s tv apple …… play store
food orange desk USB computer Apple Unix …. …. …. milk sport superbowl
topicX
How to estimate???
If we have enough data, i.e. docs about topic X
10/10000 1000/10000 30/10000
P(“computer” | topic X)
food orange desk USB computer Apple Unix …. …. milk yogurt iPad NBA sport superbowl NHL score information unix USB
doc 1doc 2
query: sport game watch
P(query | doc 1) vs. P(query | doc 2)
a document doc:
query likelihood query term likelihood
Retrieval Problem Query likelihood Term likelihood P(qi | doc)
But document is a small sample of topic… Data like:
Smoothing!
P(qi | doc) What if qi is not observed in doc? P(qi | doc) = 0?
We want give this non-zero score!!!
Smoothing
i.e.
We can make it better!
Smoothing
First, it addresses the data sparseness problem. As a document is only a very small sample, the probability P (qi | Doc) could be zero for those unseen words (Zhai & Lafferty, 2004).
Second, smoothing helps to model the background (non-discriminative) words in the query.
Improve language model estimation by using Smoothing
Smoothing
Another smoothing method:
P (w | )
if the word exist in doc
if the word not exist in doc
P (w | doc)
P (w | collection) Collection Language Model
P (w | ) = (1-λ) ∙P( query | θdoc)+λ∙P(doc| θcollection)
Smoothing
We could use collection language model:
TFIDF is closely related to Language Model and other retrieval models
Term Freq
IDFDoc length norm
Language model
Solid statistical foundation
Flexible parameter setting
Different smoothing method
Language model in library?
If we have a paper… and a query…
Similarity (paper, query) Vector Space Model
If query word not in the paper…
Score = 0
If we use language model…
Language model in library?
Likelihood of query given a paper can be estimated by:
P(query | ) = αP (query | paper) + βP (query | author) +γP (query | journal) +……
Likelihood of query given a paper & author & journal & ……
e.g. what’s the difference between web and doc retrieval???
F (doc, query)
F (web page, query)
vs
web page = doc + hyperlink + domain info + anchor text + metadata + …Can you use those to improve system performance???
Knowledge
Score each topic, level of interest
Topic 1
Topic 2
CI-n … CI-2 CI-1 CI-now
)|({)]|([/)|()]}|([)]|([)|({
)]|([/)|()]}|([)]|([)|({)(
ntoday
nintodayninintoday
nintodayninintoday
n
ZDayPelseZDayPmeanZDayPbZDayPSTDZDayPmeanZDayPifelse
ZDayPmeanZDayPaZDayPSTDZDayPmeanZDayPifTopicScore
Hot topic Diminishing topic Regular topic
CurrentInterestHistorical Interest
“Obama”, Nov 5th 2008 After Election
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 300
1
2
3
4
5
6
Nov 5th CIV:
Wiki:Barack_Obama; Wiki:Election; win; success; Wiki:President_of_the_United_States
Wiki:African_American; PresidentWorld; America; victory; record; first;president ; 44th; History; Wiki:Victory_Records
Entity:first_black_president;
Entity:first_black_president; Celebrate; black; african;
Wiki:Colin_Powell; Wiki:Secretary_of_StateWiki:United_States
Wiki:Sarah_Palin; sarah; palin; hillarySecret; Wiki:Hillary_Rodham_Clinton
Clinton; newsweek; club; cloth
1. Win2. Create history3. First black president
Google web NDCG3 NDCG5 NDCG10 t-testCIV 0.35909366 0.399970894 0.479302401 CILM 0.356652652 0.387120299 0.483420045 Google 0.230423817 0.318737414 0.388792379 **TFIDF 0.27596245 0.333012091 0.437831859 *BM25 0.284599431 0.336961764 0.436466778 *LM (liner) 0.32558799 0.382113457 0.473992963 LM (dirichlet) 0.34665084 0.358128576 0.45150825 LM (twostage) 0.349735965 0.358725227 0.450046444 BEST1: CIV CIV CILM BEST2: CILM CILM CIV Significant test *** t < 0.05 ** t < 0.10 * t < 0.15
Yahoo_web NDCG3 NDCG5 NDCG10 t-testCIV 0.351765133 0.38207777 0.475506721 CILM 0.391807685 0.40623334 0.482464858 Yahoo 0.288059321 0.326373542 0.410969176 TFIDF 0.24320988 0.282799657 0.404092457 ***BM25 0.245263974 0.277579262 0.395953269 ***LM (liner) 0.276208943 0.316889107 0.432428784 *LM (dirichlet) 0.223253393 0.270017519 0.385936078 ***LM (twostage) 0.219225991 0.266537146 0.384349848 ***BEST1: CILM CILM CILM BEST2: CIV CIV CIV Significant test *** t < 0.05 ** t < 0.10 * t < 0.15
Knowledge Retrieval System
Knowledge-based Information Need
Knowledge within Scientific Literature
Matching
Query Knowledge Representation
How to help user propose
knowledge-base queries ?
How to represent
knowledge?
How to match
between the two?
Academic Knowledge
74
Query Recommendation & Feedback
Query Recommendation
Query Feedback
76
Structural Keyword Generation- FeaturesCategory Feature Description or Example
Keyword Content
Text content of the keyword, stemmed, case insensitive, stop words removed
Content_Of_Keyword a vector of all the tokens in the keywordCAP whether the keyword is capitalized
Contain_Digit whether the keyword contains digits, i.e., TREC2002, value = trueCharacter_Length_Of_Keyword number of characters in the target keyword
Token_Length_Of_Keyword number of tokens in the keyword
Category_Length_Of_Keyword number of tokens in the keyword; if the length is more than four, we use four to represent its category length
Title Context
Exist_In_Title whether keyword exists in title (stemmed, case insensitive, stop words removed)
Location_In_Title the position where the keyword appears in the titleTitle_Text_POS unigram and its part of speech in title (in a text window)Title_Unigram unigram of keyword in title (in a text window)Title_Bigram bigram of keyword in title (in a text window)
Abstract Context
Location_In_Abstract which sentence the keyword appears in the abstractKeyword_Position_In_Sentence_O
f_Abstract the keyword’s position in the sentence (beginning, middle or end)
Abstract_Freq how many times a keyword appears in the abstractAbstract_Text_POS unigram and its part of speech in abstract (in a text window)Abstract_Unigram unigram of keyword in abstract (in a text window)Abstract_Bigram bigram of keyword in abstract (in a text window)
Evaluation – Domain Knowledge Generation
F1 Compare Concept Supervised Semi-supervised
Keyword-based
features
Research Question 0.637 0.662
Methodology 0.479 0.516Dataset 0.824 0.816
Evaluation 0.571 0.571
Keyword + Title-based
features
Research Question 0.633 0.667
Methodology 0.498 0.534Dataset 0.824 0.816
Evaluation 0.571 0.571
Keyword + Title +
Abstract-based
features
Research Question 0.642 0.663
Methodology 0.420 0.542Dataset 0.831 0.823
Evaluation 0.621 0.662
F measure comparison for Supervised Learning and Semi-Supervised Learning
GOOD! but not PERFECT…
Knowledge comes from…
• System? Machine Learning, but… low modest performance…
• User? No way! Very high cost! Author won’t contribute…
• System + User? Possible!
WikiBackyard
ScholarWiki
EditTrigger: 1. Wiki page improve; 2. Machine learning model improve; 3. All other wiki pages improve; 4. KR index improve!
User + Machine learning is powerful…YES! It helps!!!
• Knowledge retrieval for scholar publications…• Knowledge from paper• Knowledge from user– Knowledge feedback– Knowledge recommendation
• Knowledge from User vs. from Machine learning
• ScholarWiki (user) + WikiBackyard (machine)
Knowledge via Social Network and Text Mining
CITATION? CO-OCCUR?CO-AUTHOR?
Content of each node?Motivation of each citation?
With further study of citation analysis, increasing numbers of researchers have come to doubt the reasonableness of assuming that the raw number of citations reflects an article’s influence (MacRoberts & MacRoberts 1996). On the other hand, full-text analysis has to some extent compensated for the weaknesses of citation counts and has offered new opportunities for citation analysis.
Full text citation analysis
With further study of citation analysis, increasing numbers of researchers have come to doubt the reasonableness of assuming that the raw number of citations reflects an article’s influence (MacRoberts & MacRoberts 1996). On the other hand, full-text analysis has to some extent compensated for the weaknesses of citation counts and has offered new opportunities for citation analysis.
Every word @ Citation Context will VOTE!! Motivation? Topic? Reason??? Left and Right N words??N = ??????????
With further study of citation analysis, increasing numbers of researchers have come to doubt the reasonableness of assuming that the raw number of citations reflects an article’s influence (MacRoberts & MacRoberts 1996). On the other hand, full-text analysis has to some extent compensated for the weaknesses of citation counts and has offered new opportunities for citation analysis.
Word effectiveness is decaying based on the distance!!!
Closer words make more significant contribution!!
How about language model? Each node and edge represented by a language model?High dimensional space! Word difference?
Topic modeling – each node is represented by a topic distribution (Prior Distribution); each edge is represented by a topic distribution (Transitioning Probability Distribution)
Supervised topic modeling
1. Each topic has a label (YES! We can interpret each topic)2. We DO KNOW the total number of topics
Each paper is a mix probability distribution of Author Given Keywords
Keywords
Each paper: pzkeyi(paper) = p(zkeyi | abstract, title)
With further study of citation analysis, increasing numbers of researchers have come to doubt the reasonableness of assuming that the raw number of citations reflects an article’s influence (MacRoberts & MacRoberts 1996). On the other hand, full-text analysis has to some extent compensated for the weaknesses of citation counts and has offered new opportunities for citation analysis.
Paper importance
if we have 3 topics (keywords): key1, key2, key3
Domain credit: 100
pub 1
25
pub 2
25
pub 3
25
pub 4
25
P(key1 | text) = 0.6P(key2 | text) = 0.15 P(key3 | text) = 0.25
Key1-Pub1 credit: 25 * 0.6
P(key1 | citation) = 0.8P(key2 | citation) = 0.1 P(key3 | citation) = 0.1
Key1-Citation1 credit: 25 * 0.6*[0.8/(0.8+0.2)]
0.80.2
Evenly share the credits?
Citation is important if 1. citation focusing on important topic 2. other citations focusing on other topics
Paper importance
if we have 3 keywords: key1, key2, key3
Domain credit: 100
pub 1
25
pub 2
25
pub 3
25
pub 4
25
Key1-Pub1 credit: 25 * 0.6
Key1-Citation1 credit: 25 * 0.6*[0.8/(0.8+0.2)]
0.80.2
[25,25,25]
[29,26,28] [27,27,26]
[25,25,25]
Domain publication rankingDomain keyword topical rankingTopical citation tree
Citation number between paper pair is IMPORTANT!
Different citations make different contribution to different topics (keywords) to the citing publication.
Publication/venue/author topic prior
Citation transitioning topic prior
nDCG@10 nDCG@30 nDCG@50 nDCG@100 nDCG@300 nDCG@500 nDCG@1000 nDCG@3000 nDCG@5000 nDCG@ALL0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
NDCG for Review citation recommendationN
DCG
Literature Review Citation recommendation
Input: Paper Abstract
Output: A list of ranked citations
MAP and NDCG evaluation
Given a paper abstract:
1. Word level match (language model)2. Topic level match (KL-Divergence)3. Topic importance
Use Inference Network to integrate each hypothesis
Citation Recommendation
Content MatchPublication
Topical Prior
1. PageRank2. Full-text PageRank (greedy match)3. Full-text PageRank (topic modeling)
Topic match
Inference Network
Input
Output:
1. [3] YES 32. [2] YES 23. [6] NO 04. [8] NO 05. [10] YES 16. [1] NO 0……
MAP(Cite or not?)
NDCG(Important citation?)
nDCG@10 nDCG@30 nDCG@50 nDCG@100 nDCG@300 nDCG@500 nDCG@1000 nDCG@3000 nDCG@5000 [email protected]
0.15
0.2
0.25
0.3
0.35
0.4NDCG for citation recommendation based on Abstract
Based on greedy match, 1 second
Based on topic inference, 30 seconds
CONCLUSION
• Information Retrieval• Index• Retrieval Model• Ranking• User feedback• Evaluation
• Knowledge Retrieval• Machine Learning• User Knowledge• Integration • Social Network Analysis
Thank you!