Top Banner
Meeting Presentation sept.12 Things to do since last meeting: (1) find out the number of drug name in FDA website (done, the number is 6244 which is OK for us to do search crawl on twitters). (2) Read papers to find out new ideas about the query cost estimate. **Predicting query performance **what makes a query difficult, by David Camel **learning to estimate query difficulty, sigir2005 best paper. **Publications of Junghoo "John" Cho
6

Meeting Presentation sept.12

Jan 05, 2016

Download

Documents

Duyen

Meeting Presentation sept.12. Things to do since last meeting: (1) find out the number of drug name in FDA website (done, the number is 6244 which is OK for us to do search crawl on twitters). (2) Read papers to find out new ideas about the query cost estimate. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Meeting Presentation sept.12

Meeting Presentation sept.12

Things to do since last meeting:(1) find out the number of drug name in FDA website (done, the number is 6244

which is OK for us to do search crawl on twitters).

(2) Read papers to find out new ideas about the query cost estimate.

**Predicting query performance

**what makes a query difficult, by David Camel

**learning to estimate query difficulty, sigir2005 best paper.

**Publications of Junghoo "John" Cho

Page 2: Meeting Presentation sept.12

Paper Review

Predicting query performanceThis a great paper since it introduced a new concept named clarity score which

can measure the similarity between query model and collection model. It helps us to view query difficulty from a new perspective: the weakness of query terms' ability to distinguish documents may lead query difficulty.

what makes a query difficult, by David Camel This is a good development of the previous paper. It expands the concept of

clarity score to a higher level concept of “distance model”. Distance does not only apply to query & collection, but also apply to query & relevant documents, relevant documents & collection, etc. What is more, the paper adopt more reasonable function: Jensen-Shannon divergence (JSD).

Page 3: Meeting Presentation sept.12

Paper Review

learning to estimate query difficultyThe paper offers a new view that sub-query coverage may also affect query

difficulty a lot. To support such view, the authors provide two complex machine learning method: histogram and modified decision tree. The result shows that difficult query is likely to be dominated by a single sub-query.

Page 4: Meeting Presentation sept.12

Some Ideas

A straight forward idea from David's paper is that we can do query deletion to maximum the distance between query and collection. The idea is not hard to implement. But I am wondering how much improvement we can get through this way.

Page 5: Meeting Presentation sept.12

Some Ideas

An advanced idea is to connect it with retrieval cost. As we see, the traditional cost for retrieval is as following:

n*(complexity of function*DF(i))

Thus computing cost is easy to be precomputed.

It is also interesting to consider deleting low IDF and low clarity terms. It will greatly reduce the computing cost while decrease or even increase the retrieval performance.

Page 6: Meeting Presentation sept.12

Some Ideas

It is also interesting to discuss term proximity and query expansion here. In my opinion, term proximity and external query term expansion may help to improve query clarity.

The cost of term proximity is about additional:

n*(n-1)/2*(DF1+DF2+averageTF1*averageTF2*comDoc)

The cost of external query term expansion is about additional:

n*(complexity of function*DF(i))+k*averageDoclength+N*(complexity of function*DF(i))

where n is the number of query terms, k is the number of top documents for expansion and N is number of terms expansed.

It will be interesting to discuss how many clarity could term proximity and external query term expansion can add.