Noname manuscript No. (will be inserted by the editor) LETOR: A Benchmark Collection for Research on Learning to Rank for Information Retrieval Tao Qin · Tie-Yan Liu · Jun Xu · Hang Li Received: date / Accepted: date Abstract LETOR is a benchmark collection for the research on learning to rank for information retrieval, released by Microsoft Research Asia. In this paper, we describe the details of the LETOR collection and show how it can be used in different kinds of researches. Specifically, we describe how the document corpora and query sets in LETOR are selected, how the documents are sampled, how the learning features and meta information are extracted, and how the datasets are partitioned for comprehensive evaluation. We then compare several state-of-the-art learning to rank algorithms on LETOR, report their ranking performances, and make discussions on the results. After that, we discuss possible new research topics that can be supported by LETOR, in addition to algorithm comparison. We hope that this paper can help people to gain deeper understanding of LETOR, and enable more interesting research projects on learning to rank and related topics. Keywords Learning to rank · information retrieval · benchmark datasets · feature extraction 1 Introduction Ranking is the central problem for many applications of information retrieval (IR). These include document retrieval [5], collaborative filtering [16], key term extraction Tao Qin Microsoft Research Asia E-mail: [email protected]Tie-Yan Liu Microsoft Research Asia E-mail: [email protected]Jun Xu Microsoft Research Asia E-mail: [email protected]Hang Li Microsoft Research Asia E-mail: [email protected]
29
Embed
LETOR: A Benchmark Collection for Research on Learning to Rank for Information Retrieval
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Noname manuscript No.(will be inserted by the editor)
LETOR: A Benchmark Collection for Research onLearning to Rank for Information Retrieval
Tao Qin · Tie-Yan Liu · Jun Xu · Hang Li
Received: date / Accepted: date
Abstract LETOR is a benchmark collection for the research on learning to rank for
information retrieval, released by Microsoft Research Asia. In this paper, we describe
the details of the LETOR collection and show how it can be used in different kinds
of researches. Specifically, we describe how the document corpora and query sets in
LETOR are selected, how the documents are sampled, how the learning features and
meta information are extracted, and how the datasets are partitioned for comprehensive
evaluation. We then compare several state-of-the-art learning to rank algorithms on
LETOR, report their ranking performances, and make discussions on the results. After
that, we discuss possible new research topics that can be supported by LETOR, in
addition to algorithm comparison. We hope that this paper can help people to gain
deeper understanding of LETOR, and enable more interesting research projects on
learning to rank and related topics.
Keywords Learning to rank · information retrieval · benchmark datasets · featureextraction
1 Introduction
Ranking is the central problem for many applications of information retrieval (IR).
These include document retrieval [5], collaborative filtering [16], key term extraction
41 BM25 of ‘title + abstract’ Q-D42 log(BM25) of ‘title + abstract’ Q-D43 LMIR.DIR of ‘title + abstract’ Q-D44 LMIR.JM of in ‘title + abstract’ Q-D45 LMIR.ABS of in ‘title + abstract’ Q-D
well on most datasets. Among the four listwise ranking algorithms, ListNet seems to be
better than the others. AdaRank-MAP, AdaRank-NDCG and SVMMAP obtain similar
performances. Pairwise ranking algorithms achieve good ranking accuracy on some (but
not all) datasets. For example, RankBoost offers the best performance on TD2004 and
NP2003; Ranking SVM shows very promising results on NP2003 and NP2004; and
FRank achieves very good results on TD2004 and NP2004. In contrast, simple linear
regression performs worse than the pairwise and listwise ranking algorithms. Its results
are not so good on most datasets.
21
We observe that most ranking algorithms perform differently on different datasets.
They may perform very well on some datasets but not so well on the other datasets.
To evaluate the overall ranking performance of an algorithm, we used the number of
other algorithms that it can beat over all the seven datasets as a measure. That is,
Si(M) =
7∑j=1
8∑k=1
1{Mi(j)>Mk(j)}
where j is the index of a dataset, i and k are the indices of an algorithm, Mi(j) is the
performance of i-th algorithm on j-th dataset in terms of measure M (such as MAP),
and 1{Mi(j)>Mk(j)} is the indicator function
1{Mi(j)>Mk(j)} = { 1 ifMi(j) > Mk(j),
0 otherwise.
It is clear that the larger Si(M) is, the better the i-th algorithm performs. For ease
of reference, we call this measure winning number. Figure 4 shows the winning number
in terms of NDCG for all the algorithms under investigation. From this figure, we have
the following observations 4.
(1) In terms of NDCG@1, among the four listwise ranking algorithms, ListNet is better
than AdaRank-MAP and AdaRank-NDCG, while SVMMAP performs a little worse
than the others. The three pairwise ranking algorithms achieve comparable results,
among which Ranking SVM seems to be slightly better than the other two. Overall,
the listwise algorithms seem to perform better than the pointwise and pairwise
algorithms.
(2) In terms of NDCG@3, ListNet and AdaRank-MAP perform much better than the
other algorithms, while the performances of Ranking SVM, RankBoost, AdaRank-
NDCG, and SVMMAP are very similar to each other.
(3) For NDCG@10, one can get similar conclusions to those for NDCG@3.
Comparing NDCG@1, NDCG@3, and NDCG@10, it seems that the listwise ranking
algorithms have certain advantages over other algorithms at top positions (position 1)
of the ranking results. Here we give a possible explanation. Because the loss functions
of listwise algorithms are defined on all the documents of a query, it can consider all
the documents and make use of the position information of them. In contrast, the loss
functions of the pointwise and pairwise algorithms are defined on a single document or
a document pair. It cannot access the scores of all the documents at the same time and
cannot see the position information of each document. Since most IR measures (such
as MAP and NDCG) are position based, listwise algorithms which can see the position
information in their loss functions should perform better than pointwise and pairwise
algorithms which cannot see such information in their loss functions.
Figure 5 shows the winning number in terms of Precision and MAP. We have the
following observations from the figure.
4 These observations are based on the results on the LETOR website when the paper wassubmitted. The website is continuously updating to incorporate the active contributions fromthe IR community. For the latest status of the algorithms, please refer to the LETOR website:http://research.microsoft.com/~letor.
22
(1) In terms of P@1, among the four listwise ranking algorithms, ListNet is better
than AdaRank-NDCG, while AdaRank-MAP and SVMMAP perform worse than
AdaRank-NDCG. The three pairwise ranking algorithms achieve comparable re-
sults, among which Ranking SVM seems to be slightly better. Overall, the listwise
algorithms seem to perform than the pointwise and pairwise algorithms.
(2) For P@3, one can get similar conclusions to those for P@1.
(3) In terms of P@10, ListNet performs much better than all the other algorithms; the
performances of Ranking SVM, RankBoost and FRank are better than AdaRank-
MAP, AdaRank-NDCG, and SVMMAP.
(4) In terms of MAP, ListNet is the best one; Ranking SVM, AdaRank-MAP, and
SVMMAP achieve similar results, and are better than the remaining algorithms.
Furthermore, the variance among the three pairwise ranking algorithms is much
larger than the variance in terms of other measures (P@1, 3 and 10). The possible
explanation is as follows. Since MAP involves all the documents associated with a
query in the evaluation process, it can better differentiate algorithms.
To summarize, the experimental results show that the listwise algorithms (ListNet,
AdaRank-MAP, AdaRank-NDCG, and SVMMAP) have certain advantages over other
algorithms, especially for the top positions of the ranking results.
Note that the above experimental results are in some sense still preliminary, since
the result of almost every algorithm can be further improved. For example, for regres-
sion, we can add a regularization item to make it more robust; for Ranking SVM, if the
time complexity is not an issue, we can remove the constraint of -# 5000 to achieve a
better convergence of the algorithm; for ListNet, we can also add a regularization item
to its loss function and make it more generalizable to the test set. Considering these
issues, we would like to call for contributions from the research community. Researchers
are encouraged to submit the results of their newly developed algorithms as well as
their carefully tuned existing algorithms to LETOR. In order to let others re-produce
the submitted results, the contributors are respectfully asked to prepare a package for
the algorithm, including
(1) a brief document introducing the algorithm;
(2) an executable file of the algorithm;
(3) a script to run the algorithm on the seven datasets of LETOR.
We believe with the collaborative efforts of the entire community, we can have
more versatile and reliable baselines on LETOR, and better facilitate the research on
learning to rank.
5 Supporting New Research Directions
So far LETOR has mainly been used as an experimental platform to compare different
algorithms. In this section, we show that LETOR can also be used to support many
new research directions.
5.1 Ranking Models
Most of the previous work (as reviewed in Section 2) focuses on developing better loss
functions, and simply uses a scoring function as the ranking model. It may be one of
23
0
10
20
30
40
NDCG1 NDCG3 NDCG10
Win
Num
ber
Measure
RegressionRankSVMRankBoost
FRankListNet
AdaRank-MAP
AdaRank-NDCGSVMMAP
Fig. 4 Comparison cross seven datasets by NDCG
0
10
20
30
40
P1 P3 P10 MAP
Win
Num
ber
Measure
RegressionRankSVMRankBoost
FRankListNet
AdaRank-MAP
AdaRank-NDCGSVMMAP
Fig. 5 Comparison cross seven datasets by Precision and MAP
the major topics at the next step to investigate new ranking models. For example, one
can study new algorithms using pairwise and listwise ranking functions. Please note
the challenge of using a pairwise/listwise ranking function. That is, the test complexity
will be much higher than that of using a scoring function. One should pay attention to
the issue if he/she performs research on pairwise and listwise ranking functions.
24
5.2 Feature Engineering
Feature is, by all means, very important for learning to rank algorithms. Since LETOR
contains rich meta information, it can be used to study feature related problems.
Feature Extraction
The performance of a ranking algorithm greatly depends on the effectiveness of the
features used. LETOR contains the low-level information such as term frequency and
document length. It also contains rich meta information about the corpora and the
documents. These can be used to derive new features, and study their contributions to
ranking.
Feature Selection
Feature selection has been extensively studied for classification. However, as far
as we know, the discussions on feature selection for ranking are still very limited.
LETOR contains tens of standard features, and it is feasible to use LETOR to study
the selection of most effective features for ranking.
Dimensionality Reduction
Different from feature selection, dimensionality reduction tries to reduce the number
of features by transforming/combining the original features. Dimensionality reduction
has been shown very effective in many applications, such as face detection and signal
processing. Similarly to feature selection, little work has been done on dimensionality
reduction for ranking. It is an important research topic, and LETOR can be used to
support such research.
5.3 New Ranking Scenarios
LETOR also offers the opportunities to investigate some new ranking scenarios. Here
we give several examples as follows.
Query Classification and Query Dependent Ranking
In most previous work, a single ranking function is used to handle all kinds of
queries. This may not be appropriate, particularly for web search. Queries in web search
may vary largely in semantics and user intentions. Using a single model alone would
make compromises among queries and result in lower accuracy in relevance ranking.
Instead, it would be better to exploit different ranking models for different queries.
Since LETOR contains several different kinds of query sets (such as topic distillation,
homepage finding, and named page finding) and rich information about queries, it is
possible to study the problems of query classification and query dependent ranking
[14].
Beyond Independent Ranking
Existing technologies on learning to rank assume that the relevance of a document
is independent of the relevance of other documents. The assumption makes it possible
to score each document independently first and sort the documents according to their
scores after that. In reality, the assumption may not always hold. There are many
25
retrieval applications in which documents are not independent and relation information
among documents can be or must be exploited. For example, Web pages from the same
site form a sitemap hierarchy. If both a page and its parent page are about the topic
of the query, then it would be better to rank higher the parent page for this query. As
another example, similarities between documents are available, and we can leverage
the information to enhance relevance ranking. Other problems like Subtopic Retrieval
[36] also need utilize relation information. LETOR contains rich relation information,
including hyperlink graph, similarity matrix, and sitemap hierarchy, and therefore can
well support the research on dependent ranking.
Multitask Ranking and Transfer Ranking
Multitask learning aims at learning several related task at the same time, and the
learning of the tasks can benefit from each other. In other words, the information
provided by the training signal for each task serves as a domain-specific inductive bias
for the other tasks. Transfer learning uses the data in one or more auxiliary domains
to help the learning task in the main domain. Because LETOR contains seven query
sets and three different retrieval tasks, it is a good test bed for multitask ranking and
transfer ranking.
To summarize, although the current use of LETOR is mostly about algorithm
comparison, LETOR actually can be used to support much richer research agenda. We
hope that more and more interesting researches can be carried out with the help of
LETOR, and the state of the art of learning to rnak can be significantly advanced.
6 Limitations
Although LETOR has been widely used, it has certain limitations as listed below.
Document Sampling Strategy
For the “Gov” datasets, the retrieval problem is essentially cast as a re-ranking
task (for top 1000 documents) in LETOR. On one hand, this is a common practice
for real-world Web search engines. Usually two rankers are used by a search engine for
sake of efficiency: firstly a simple ranker (e.g., BM25) is used to select some candidate
documents, and then a more complex ranker (e.g., the learning to rank algorithms
as mentioned in the paper) is used to produce the final ranking result. On the other
hand, however, there are also some retrieval applications that should not be cast as
a re-ranking task. We will add datasets beyond re-ranking settings to LETOR in the
future.
For the “Gov” datasets, we sampled documents for each query using a cutoff number
of 1000. We will study the impact of the cutoff number on the performances of the
ranking algorithms. It is possible that the dataset should be refined using a better
cutoff number.
Features
In both academic and industrial communities, more and more features have been
studied and applied to improve ranking accuracy. The feature list provided in LETOR
is far away from comprehensive. For example, document features (such as document
length) are not included in the OHSUMED dataset, and proximity features are not
included in all the seven datasets. We will add more features into the LETOR datasets
in the future.
26
Scale and Diversity of Datasets
As compared with the real-web search, the scale (number of queries) of the datasets
in LETOR is not yet very large. To verify the performances of learning to rank tech-
niques for real-web search, large scale datasets are needed. We are working on some
large scale datasets and plan to release them in the future versions of LETOR.
Although there are seven query sets in LETOR3.0, there are only two document
corpora involved. We will create new datasets using more document corpora in the
future.
Baselines
Most baseline algorithms in LETOR use linear ranking functions. From Section 4.3,
we can see that the performances of these algorithms are not good enough, since the
perfect ranking should achieved the accuracy of 1 in terms of all the measures (P@k,
MAP and NDCG). As pointed out in Section 3.3, class-Q features cannot be effectively
used by linear ranking functions. We will add more algorithms with nonlinear ranking
functions as baselines of LETOR. We also encourage researchers in the community to
test more non-linear ranking models.
7 Conclusions
By explaining the data creation process and the results of the state-of-the-arts learning
to rank algorithms in this paper, we have provided the information for people to better
understand the nature of LETOR and to more effectively utilize the datasets in their
research work.
We have received a lot of comments and feedback about LETOR after its first
release. We hope we can get more suggestions from the research community. We also
encourage researchers to contribute to LETOR by submitting their results.
Finally, we expect that LETOR can be just a start for the research community to
build benchmark datasets for learning to rank. With more and more such efforts, the
research on learning to rank for IR can be significantly advanced.
Acknowledgements We would like to thank Chao-Liang Zhong, Kang Ji and WenyingXiong for their work on the creation of the LETOR datasets, and thank Da Kuang, Chao-Liang Zhong, Yong-Deok Kim, Ming-Feng Tsai, Yi-Song Yue, Olivier Chapelle and ThorstenJoachims for their work on the baseline algorithms. We would like to thank Yunhua Hu forproviding the extracted titles of the “Gov” corpus and thank Guomao Xin, Shuming Shi, Rui-hua Song, Zhicheng Dou and Jirong Wen for their help on corpus processing and indexing. Wewould also like to thank Lan Nie,Brian D. Davison, and Xiaoguang Qi for providing the webpage classification models for the feature extraction of the “Gov” corpus.
References
1. A. Asuncion and D. Newman. UCI machine learning repository, 2007.2. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley,
May 1999.3. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Comput.
Netw. ISDN Syst., 30(1-7):107–117, 1998.
27
4. C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender.Learning to rank using gradient descent. In ICML ’05: Proceedings of the 22nd interna-tional conference on Machine learning, pages 89–96, New York, NY, USA, 2005. ACMPress.
5. Y. Cao, J. Xu, T.-Y. Liu, H. Li, Y. Huang, and H.-W. Hon. Adapting ranking svm todocument retrieval. In SIGIR ’06: Proceedings of the 29th annual international ACMSIGIR conference on Research and development in information retrieval, pages 186–193,New York, NY, USA, 2006. ACM Press.
6. Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li. Learning to rank: from pairwise approachto listwise approach. In ICML ’07: Proceedings of the 24th international conference onMachine learning, pages 129–136, New York, NY, USA, 2007. ACM Press.
7. G. Chechik, G. Heitz, G. Elidan, P. Abbeel, and D. Koller. Max-margin classification ofdata with absent features. J. Mach. Learn. Res., 9:1–21, 2008.
8. P. Chirita, J. Diederich, and W. Nejdl. MailRank: using ranking for spam detection.In Proceedings of the 14th ACM international conference on Information and knowledgemanagement, pages 373–380. ACM New York, NY, USA, 2005.
9. M. Collins. Ranking algorithms for named-entity extraction: Boosting and the votedperceptron. In Proceedings of the 40th Annual Meeting on Association for ComputationalLinguistics, July, pages 07–12, 2002.
10. N. Craswell and D. Hawking. Overview of the TREC-2004 Web track. 2004.11. N. Craswell, D. Hawking, R. Wilkinson, and M. Wu. Overview of the TREC 2003 web
track. In Proceedings of TREC 2003, 2003.12. K. Dave, S. Lawrence, and D. Pennock. Mining the peanut gallery: opinion extraction
and semantic classification of product reviews. In Proceedings of the 12th internationalconference on World Wide Web, pages 519–528. ACM Press New York, NY, USA, 2003.
13. Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An efficient boosting algorithm forcombining preferences. J. Mach. Learn. Res., 4:933–969, 2003.
14. X. Geng, T.-Y. Liu, T. Qin, A. Arnold, H. Li, and H.-Y. Shum. Query dependent rankingusing k-nearest neighbor. In SIGIR ’08: Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval, pages 115–122,New York, NY, USA, 2008. ACM.
15. Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. InProceedings of the Thirtieth international conference on Very large data bases-Volume 30,pages 576–587. VLDB Endowment, 2004.
16. E. F. Harrington. Online ranking/collaborative filtering using the perceptron algorithm.In Proceedings of the 20th International Conference on Machine Learning, pages 250–257,2003.
17. R. Herbrich, T. Graepel, and K. Obermayer. Support vector learning for ordinal regression.In ICANN1999, pages 97–102, 1999.
18. W. Hersh, C. Buckley, T. J. Leone, and D. Hickam. Ohsumed: an interactive retrievalevaluation and new large test collection for research. In SIGIR ’94, pages 192–201, NewYork, NY, USA, 1994. Springer-Verlag New York, Inc.
19. Y. Hu, G. Xin, R. Song, G. Hu, S. Shi, Y. Cao, and H. Li. Title extraction from bodies ofhtml documents and its application to web page retrieval. In SIGIR ’05: Proceedings ofthe 28th annual international ACM SIGIR conference on Research and development ininformation retrieval, pages 250–257, New York, NY, USA, 2005. ACM.
20. J. C. Huang and B. J. Frey. Structured ranking learning using cumulative distributionnetworks. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances inNeural Information Processing Systems 21, pages 697–704. 2009.
21. K. Jarvelin and J. Kekalainen. Cumulated gain-based evaluation of ir techniques. ACMTrans. Inf. Syst., 20(4):422–446, 2002.
22. T. Joachims. Optimizing search engines using clickthrough data. In KDD ’02: Proceedingsof the eighth ACM SIGKDD international conference on Knowledge discovery and datamining, pages 133–142, New York, NY, USA, 2002. ACM Press.
23. D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A New Benchmark Collection for TextCategorization Research. The Journal of Machine Learning Research, 5:361–397, 2004.
24. L. Li and H.-T. Lin. Ordinal regression by extended binary classification. In NIPS, pages865–872, 2006.
25. P. Li, C. Burges, and Q. Wu. Mcrank: Learning to rank using multiple classificationand gradient boosting. In Advances in Neural Information Processing Systems 20, pages897–904. MIT Press, Cambridge, MA, 2008.
28
26. I. Matveeva, C. Burges, T. Burkard, A. Laucius, and L. Wong. High accuracy retrieval withmultiple nested ranker. In SIGIR ’06: Proceedings of the 29th annual international ACMSIGIR conference on Research and development in information retrieval, pages 437–444,New York, NY, USA, 2006. ACM.
27. T. Minka and S. Robertson. Selection bias in the LETOR datasets. In Proceedings ofSIGIR 2008 Workshop on Learning to Rank for Information Retrieval, 2008.
28. L. Nie, B. D. Davison, and X. Qi. Topical link analysis for web search. In SIGIR ’06:Proceedings of the 29th annual international ACM SIGIR conference on Research anddevelopment in information retrieval, pages 91–98, New York, NY, USA, 2006. ACM.
29. B. Pang and L. Lee. Seeing stars: exploiting class relationships for sentiment categorizationwith respect to rating scales. In Proceedings of the 43rd Annual Meeting on Associationfor Computational Linguistics, pages 115–124. Association for Computational LinguisticsMorristown, NJ, USA, 2005.
30. T. Qin, T. Liu, J. Xu, and H. Li. How to Make LETOR More Useful and Reliable. InProceedings of SIGIR 2008 Workshop on Learning to Rank for Information Retrieval,2008.
31. T. Qin, T.-Y. Liu, and H. Li. A general approximation framework for direct optimiza-tion of information retrieval measures. Technical Report MSR-TR-2008-164, MicrosoftCorporation, 2008.
32. T. Qin, T.-Y. Liu, X.-D. Zhang, Z. Chen, and W.-Y. Ma. A study of relevance propagationfor web search. In SIGIR ’05, pages 408–415, New York, NY, USA, 2005. ACM Press.
33. T. Qin, T.-Y. Liu, X.-D. Zhang, D.-S. Wang, and H. Li. Global ranking using continuousconditional random fields. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors,NIPS. MIT Press, 2008.
34. T. Qin, T.-Y. Liu, X.-D. Zhang, D.-S. Wang, W.-Y. Xiong, and H. Li. Learning to rankrelational objects and its application to web search. In WWW ’08: Proceeding of the 17thinternational conference on World Wide Web, pages 407–416, New York, NY, USA, 2008.ACM.
35. T. Qin, X.-D. Zhang, M.-F. Tsai, D.-S. Wang, T.-Y. Liu, and H. Li. Query-level lossfunctions for information retrieval. Information Processing & Management, 44(2):838–855, 2008.
36. T. Qin, X.-D. Zhang, D.-S. Wang, T.-Y. Liu, W. Lai, and H. Li. Ranking with multiplehyperplanes. In SIGIR ’07, pages 279–286, New York, NY, USA, 2007. ACM Press.
37. S. Robertson, H. Zaragoza, and M. Taylor. Simple bm25 extension to multiple weightedfields. In CIKM ’04: Proceedings of the thirteenth ACM international conference onInformation and knowledge management, pages 42–49, New York, NY, USA, 2004. ACM.
38. S. E. Robertson and D. A. Hull. The TREC-9 filtering track final report. In TREC, pages25–40, 2000.
39. A. Shakery and C. Zhai. Relevance Propagation for Topic Distillation UIUC TREC-2003Web Track Experiments. In Proceedings of TREC, pages 673–677, 2003.
40. M. Taylor, J. Guiver, S. Robertson, and T. Minka. Softrank: optimizing non-smooth rankmetrics. In WSDM ’08: Proceedings of the international conference on Web search andweb data mining, pages 77–86, New York, NY, USA, 2008. ACM.
41. TREC2004 Web track guideline. http://research.microsoft.com/en-us/um/people/nickcr/guidelines_2004.html.
42. M.-F. Tsai, T.-Y. Liu, T. Qin, H.-H. Chen, and W.-Y. Ma. Frank: a ranking method withfidelity loss. In SIGIR ’07, pages 383–390, New York, NY, USA, 2007. ACM Press.
43. M. N. Volkovs and R. S. Zemel. Boltzrank: learning to maximize expected ranking gain. InICML ’09: Proceedings of the 26th Annual International Conference on Machine Learning,pages 1089–1096, New York, NY, USA, 2009. ACM.
44. E. Voorhees and D. Harman. TREC: Experiment and Evaluation in Information Retrieval.MIT Press, 2005.
45. F. Xia, T.-Y. Liu, J. Wang, W. Zhang, and H. Li. Listwise approach to learning to rank -theory and algorithm. In ICML ’08: Proceedings of the 25th international conference onMachine learning, New York, NY, USA, 2008. ACM Press.
46. J. Xu, Y. Cao, H. Li, and M. Zhao. Ranking definitions with supervised learning methods.In International World Wide Web Conference, pages 811–819. ACM Press New York, NY,USA, 2005.
47. J. Xu and H. Li. Adarank: a boosting algorithm for information retrieval. In SIGIR ’07:Proceedings of the 30th annual international ACM SIGIR conference on Research anddevelopment in information retrieval, pages 391–398, New York, NY, USA. ACM Press.
29
48. J. Xu, T.-Y. Liu, M. Lu, H. Li, and W.-Y. Ma. Directly optimizing evaluation measuresin learning to rank. In SIGIR ’08: Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval, pages 107–114,New York, NY, USA, 2008. ACM.
49. G.-R. Xue, Q. Yang, H.-J. Zeng, Y. Yu, and Z. Chen. Exploiting the hierarchical structurefor link analysis. In SIGIR ’05: Proceedings of the 28th annual international ACM SIGIRconference on Research and development in information retrieval, pages 186–193, NewYork, NY, USA, 2005. ACM.
50. Y. Yue, T. Finley, F. Radlinski, and T. Joachims. A support vector method for optimizingaverage precision. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIRconference on Research and development in information retrieval, pages 271–278, NewYork, NY, USA, 2007. ACM Press.
51. C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to AdHoc information retrieval. In Proceedings of the 24th annual international ACM SIGIRconference on Research and development in information retrieval, pages 334–342. ACMPress New York, NY, USA, 2001.
52. C. X. Zhai, W. W. Cohen, and J. Lafferty. Beyond independent relevance: methods andevaluation metrics for subtopic retrieval. In SIGIR ’03: Proceedings of the 26th annual in-ternational ACM SIGIR conference on Research and development in informaion retrieval,pages 10–17, New York, NY, USA, 2003. ACM Press.