INCORPORATING PROBABILISTIC RETRIEVAL KNOWLEDGE INTO TFIDF-BASED SEARCH ENGINE Alex Lin Senior Architect Intelligent Mining alin at IntelligentMinining.com
Jan 15, 2015
INCORPORATING PROBABILISTIC RETRIEVAL KNOWLEDGE INTO TFIDF-BASED SEARCH ENGINE Alex Lin Senior Architect Intelligent Mining alin at IntelligentMinining.com
Overview of Retrieval Models Boolean Retrieval Vector Space Model Probabilistic Model Language Model
Boolean Retrieval lincoln AND NOT (car AND automobile) The earliest model and still in use today The result is very easy to explain to users Highly efficient computationally The major drawback – lack of sophisticated
ranking algorithm.
Vector Space Model
Doc2
Doc1
Query
Term3
Ter
m2
€
Cos(Di,Q) =
dij *q jj=1
t
∑
dij2 * q j
2
j=1
t
∑j=1
t
∑
Major flaws: It lacks guidance on the details of how weighting and ranking algorithms are related to relevance
Probabilistic Retrieval Model
Relevant
Non-Relevant
Document
P(R|D)
P(NR|D)
€
P(R |D) =P(D |R)P(R)
P(D)Bayes’ Rule
Probabilistic Retrieval Model
If then classify D as relevant
€
P(R |D) =P(D |R)P(R)
P(D)
€
P(NR |D) =P(D |NR)P(NR)
P(D)
€
P(D |R)P(R) > P(D |NR)P(NR)
Estimate P(D|R) and P(D|NR) Define
€
D = (d1,d2,...,dt )
then
€
P(D |R) = P(di |R)i=1
t∏
€
P(D |NR) = P(di |NR)i=1
t∏
Binary Independence Model term independence + binary features in documents
Likelihood Ratio Likelihood ratio:
€
P(D |R)P(D |NR)
>P(NR)P(R)
€
P(D |R)P(D |NR)
=pisii:d i =1
∏ ⋅1− pi1− sii:d i = 0
∏ = log pi(1− si)si(1− pi)i:d i =1
∑
€
= log (ri + 0.5) /(R − ri + 0.5)(ni − ri + 0.5) /(N − ni − R + ri + 0.5)i:d i = qi =1
∑
si: in non-relevant set, the probability of term i occurring pi: in relevant set, the probability of term i occurring
N: total number of Non-relevant documents ni: number of non-relevant documents that contain a term ri: number of relevant documents that contain a term R: total number of Relevant documents
Combine with BM25 Ranking Algorithm BM25 extends the scoring function for the binary
independence model to include document and query term weight.
It performs very well in TREC experiments
€
R(q,D) = log (ri + 0.5) /(R − ri + 0.5)(ni − ri + 0.5) /(N − ni − R + ri + 0.5)i∈Q
∑ ⋅(ki +1) f iK + f i
⋅(k2 +1)qfik2 + qfi
€
K = k1((1− b) + b ⋅ dlavgdl
)
k1 k2 b: tuning parameters dl: document length avgdl: average document length in data set qf: term frequency in query terms
Weighted Fields Boolean Search
doc-id field0 field1 … text
1
2
3
…
n
€
R(q,D) = w f mif ∈ fileds∑
i∈q∑
Apply Probabilistic Knowledge into Fields
doc-id field0 field1 … Text
1
2 Lightyear Buzz
3
…
n
Relevant
P(R|D)
Document Non-
Relevant P(NR|D)
Higher gradient Lower
Use the Knowledge during Ranking
€
P(D |R) = P(di |R)i=1
t∏ = log(P(di |R)
i=1
t
∑ ) ≈ w f mif ∈F∑
i∈q∑
doc-id field0 field1 … Text
1
2 Lightyear Buzz
3
…
n
The goal is:
Learnable
Comparison of Approaches
€
RTF−IDF = tf ik ⋅ idfi =f ik
f ijj=1
t
∑⋅ log N
nk
€
R(q,D) = log (ri + 0.5) /(R − ri + 0.5)(ni − ri + 0.5) /(N − ni − R + ri + 0.5)i∈Q
∑ ⋅(k1 +1) f iK + f i
⋅(k2 +1)qfik2 + qfi€
K = k1((1− b) + b ⋅ dlavgdl
)
€
Rbm25(q,D) =(k1 +1) f iK + f i
⋅(k2 +1)qfik2 + qfi
€
R(q,D) = w f mif ∈F∑
i∈q∑ ⋅
(k1 +1) f iK + f i
⋅(k2 +1)qfik2 + qfi
IDF TF
IDF TF
Other Considerations This is not a formal model Require user relevance feedback (search log) Harder to handle real-time search queries How to Prevent Love/Hate attacks
Thank you