Top Banner
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google, Inc. (SIGIR 2008) Speaker: Chin-Wei Cho Advisor: Dr. Jia-Ling K oh Date: 2009/1/8
27

Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Jan 02, 2016

Download

Documents

Beryl Norman
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Retrieval Models for Question and Answer

Archives

Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft

Computer Science Department University of Massachusetts,

Google, Inc.

(SIGIR 2008) Speaker: Chin-Wei Cho

Advisor: Dr. Jia-Ling KohDate: 2009/1/8

Page 2: Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Out-Line Introduction Query likelihood language model vs. IBM

translation model The retrieval model for Question and

Answer Archives Learning word-to-word translation

probabilities Experiments Conclusion and Future work

Page 3: Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Introduction Question and Answer (Q&A) archives have bec

ome an important information resource on the Web (EX: Yahoo Answers!, Live QnA)

The retrieval task in a Q&A archive is to find relevant question-answer pairs for new questions posed by the user

Page 4: Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Introduction Advantages of Q&A retrieval over Web Search

User can use natural language instead of only keywords as a query, and thus can potentially express his/her information need more clearly

System returns several possible answers directly instead of a long list of ranked documents, and can therefore increase the efficiency of finding the required answers

Q&A retrieval can also be considered as an alternative solution to the general Question Answering (QA) problem. Since the answers for each question in the Q&A archive are generated by humans, the difficult QA task of extracting a correct answer is transformed to the Q&A retrieval task.

Page 5: Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Introduction Challenge for Q&A retrieval

Word mismatch between the user’s question and the question-answer pairs in the archive

“What is Steve Jobs best known for?” and “Who is the CEO of Apple Inc?” Similar questions but no words in common

We focus on translation-based approaches since the relationships between words can be explicitly modeled through word-to-word translation probabilities

Page 6: Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Introduction Design the translation based retrieval

model IBM translation model 1 Query likelihood language model

Learn good word-to-word translation probabilities The asker and the answerer may express

similar meanings with different words Use the question-answer pairs as the “parallel

corpus” Source : Target => Q:A or A:Q or Both?

Page 7: Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Introduction For the question part,

the query is generated by our proposed translation-based language model

For the answer part, the query is simply generated by the query likelihood language model

Our final model for Q&A retrieval is a combination of the above models.

Page 8: Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Query likelihood language modelVS IBM model

q is the query D is the document C is the background collection λ is the smoothing parameter |D| and |C| are the lengths of D and C #(t,D) denotes the frequency of term t in D P(w|null) is the probability that the term w is translated (generated) from the

null term P(w|t) is the the translation probability from word t to word w

Page 9: Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Query likelihood language modelVS IBM model Pml(w|C) vs. P(w|null)

Query likelihood: background distribution generates common terms that connect content words

IBM: generate spurious terms in the target sentence a little awkward and less stable

λ vs. 1 the lack of a mechanism to control background smoothing in IBM model leads to poor p

erformance Pml(w|D) vs. Ptr(w|D)

Query likelihood: use maximum likelihood estimator, Gives zero probabilities for unseen words in the document

IBM: Every word in the document has some probability of being translated into a target word and these probabilities are added up to calculate the sampling probability

Page 10: Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Query likelihood language modelVS IBM model However, we cannot simply choose the sampling

method used in the IBM model because of the self translation problem. Since the target and the source languages are the same, every word has some probability to translate into itself.

Low self-translation probabilities reduce retrieval performance by giving very low weights to the matching terms.

Very high self-translation probabilities do not exploit the merits of the translation approach.

Page 11: Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

The retrieval model Our Final Translation-Based Language Model f

or the Question Part C denotes the whole archive,

C = {(q, a)1, (q, a)2, ..., (q, a)L}. Q denotes the set of all questions in C,

Q = {q1, q2, ..., qM} A denotes the set of all answers in C,

A = {a1, a2, ..., aN}. Given the user question q2,

the task of Q&A retrieval is to rank (q, a)i according to score(q, (q, a)i).

Page 12: Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

The retrieval model Linearly mix two different estimations: maximum

likelihood estimation and translation based estimation

Query Likelihood Language Model for the Answer Part

Page 13: Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Learning word-to-word translation probabilities In a Q&A archive, question-answer pairs can be

considered as a type of parallel corpus, which is used for estimating word-to-word translation probabilities

In IBM translation model 1, English is the source language and French is the target language

Since the questions and answers in a Q&A archive are written in the same language, the word-to-word translation probability can be calculated through setting either as the source and the other as the target

Page 14: Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Learning word-to-word translation probabilities P(A|Q) is used to denote the word-to-word

translation probability with the question as the source and the answer as the target

P(Q|A) is used to denote the opposite configuration

EX: Question: “cheat” Answer: “trust”, “forgive”, “dump” ,“leave”

Answer : “cheat”Question : “husband”, “boyfriend”

Page 15: Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Learning word-to-word translation probabilities w2 should be more similar to w1 than w3. This intuitio

n will be considered implicitly by combining P(Q|A) and P(A|Q), since P(w2|w1) will get contributions from both P(Q|A) and P(A|Q), but P(w3|w1) only gets the contribution from P(A|Q).

Q A Q A w1 w2 w2 w1

w3

Page 16: Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Learning word-to-word translation probabilities Combine P(Q|A) and P(A|Q) instead of choosing just on

e of them linearly combines pools the Q-A pairs used for learning P(A|Q) and the A-Q pairs

used for learning P(Q|A) together, and learn the combined word-to-word translation probabilities

Page 17: Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Experiments The Wondir collection: 1 milliom Q-A pairs

Topics for questions are very diverse, ranging from restaurant recommendations to rocket science

The average length for the question part and the answer part is 27 words and 28 words

Spelling errors are very common in this collection, which makes the word mismatch problem very serious.

50 questions from the TREC-9 QA track are used for testing

Page 18: Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Experiments Since the relevance of the answer to its

corresponding question is usually guaranteed, the retrieval performance of a system can be measured by the rank of relevant questions it returns.

Ranking algorithms first output question-answer pair ranks that are then transformed into question ranks.

Page 19: Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Experiments-1 Show the importance of the question part and the

answer part for Q&A retrieval. The query likelihood retrieval model was used

with the question parts, the answer parts, and the question answer pairs

Page 20: Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Experiments-2 Compare Three types of baselines:

Type I: Query Likelihood Language Model (LM), Okapi BM25 (Okapi) and Relevance Model (RM). This type of baseline represents state-of-the-art retrieval models

Type II: The combination model which combines the language model estimated from the question part and the answer part at the word level (LM-Comb). This model is equivalent to setting β as zero.

Type III: Other translation-based models. This type of baseline represents previous work on translation-based language models.

Page 21: Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Experiments-2 Compare TransLM model performs better than both the state-of-the-art re

trieval systems P(A|Q) is more effective than P(Q|A), which can be explained as th

e question source being more important than the answer source for generating the user question.

Page 22: Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Experiments-3

Page 23: Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Experiments-4 Compares the effect of Plin and Ppool with P(A|Q)

and P(Q|A) when used with TransLM

Page 24: Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Experiments-5

Page 25: Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Experiments-6 TransLM+QL: our retrieval model for question-answer

pairs that incorporates the answer part.

Compares TransLM+QL with TransLM and LM-Combine. Ppool is used as the method for estimating translation probabilities.

Page 26: Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Experiments

Page 27: Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Conclusion and Future work Q&A retrieval has become an important issue due to the

popularity of Q&A archives on the web. In this paper, we propose a novel translation-based language model to solve this problem.

Combines the translation-based language model estimated using the question part and the query likelihood language model estimated using the answer part.

Using different configurations of question-answer pairs to improve the quality.

Phrase-based machine translation models have shown superior performance compared to word-based translation models in translation applications, We plan to study the effectiveness of these models in the Q&A setting