SIRâ€™11: Information Retrieval Over Query Sessions

Workshop of the 33rd Annual European BCS-IRSG Conference on Information Retrieval

Program Committee

SIR’11: Information Retrieval Over Query Sessions

Van Dang, University of Massachuse6s, Amherst Henry Field, University of Massachuse6s, Amherst Ayse Goker, City University London Jimmy Huang, York University Jim Jansen, Penn State Kalervo Järvelin, University of Tampere Rosie Jones, Akamai Ben Piwowarski, University of Glasgow Maarten de Rijke, University of Amsterdam Ian Ruthven, University of Strathclyde Mark Smucker, University of Waterloo Emine Yilmaz, MicrosoQ Research, Cambridge

Organizers

Ben Cartere6e, University of Delaware Evangelos Kanoulas, University of Sheffield Paul Clough, University of Sheffield Mark Sanderson, RMIT University

Workshop of the 33rd Annual European BCS-IRSG Conference on Information Retrieval

Invited Talk Applica'ons of Web Search Session Analysis Rosie Jones

Discussion Panel Informa'on Retrieval Evalua'on over Query Sessions Kalervo Järvelin, Stephen Robertson, Tetsuya Sakai, Mark Sanderson

Accepted Papers 1.  The Use of Domain Modelling to Improve Performance Over a Query Session -‐-‐ Deirdre Lungley,

M-‐Dyaa Albakour and Udo Kruschwitz 2.  Implicit relevance feedback from a mul'-‐step search process: a use of query-‐logs -‐-‐ Corrado

Boscarino, Arjen De Vries, Vera Hollink and Jacco Van Ossenbruggen 3.  Same Query -‐-‐ Different Results? A Study of Repeat Queries in Search Sessions -‐-‐ Johannes

Leveling and Gareth Jones 4.  New user profile learning for extremely sparse data sets -‐-‐ Tomasz Hoffmann, Tadeusz

Janasiewicz and Andrzej Szwabe 5.  An Explora'on of Query Term Dele'on -‐-‐ Hao Wu and Hui Fang 6.  Query Session Data vs. Clickthrough Data as Query Sugges'on Resources -‐-‐ Makoto P. Kato,

Tetsuya Sakai and Katsumi Tanaka 7.  Query Session Detec'on as a Cascade -‐-‐ Ma6hias Hagen, Benno Stein and Tino Rüb 8.  Automa'c Genera'on of Query Sessions using Text Segmenta'on -‐-‐ Debasis Ganguly, Johannes

Leveling and Gareth Jones 9.  Crowdsourcing Interac'ons: Capturing query sessions through crowdsourcing -‐-‐ Guido Zuccon,

Teerapong Leelanupab, Stewart WhiZng, Emine Yilmaz, Joemon Jose and Leif Azzopardi

Workshop Proceedings

Preface

Research in Information Retrieval has traditionally focused on serving the bestresults for a single query. Real users, however, often begin an interaction witha search engine with a sufficiently under-specified query that they will need toreformulate before they find either the thing or every thing they are lookingfor. Early studies on web search query logs showed that half of all Web usersreformulated their initial query: 52% of the users in 1997 Excite data set, 45%of % the users in the 2001 Excite dataset. The focus of this Workshop is the ad-vances in Information Retrieval technology over query sessions. The Workshop isstructured along two main themes: (A) algorithms that use information providedthroughout query sessions for a number of tasks, ranging from ranking to querysuggestion and construction of user profiles, and (B) query session detection andconstruction of query session collections.

Ben Carterette April 2011Evangelos KanoulasPaul CloughMark SandersonSIR’11

Program Committee

Van Dang, University of Massachusetts, AmherstHenry Feild, University of Massachusetts, AmherstAyse Goker, City University LondonJimmy Huang, York UniversityJim Jansen, Penn StateKalervo Jarvelin, University of TampereRosie Jones, AkamaiBen Piwowarski, University of GlasgowMaarten de Rijke, University of AmsterdamIan Ruthven, University of StrathclydeMark Smucker, University of WaterlooEmine Yilmaz, Microsoft Research, Cambridge

Invited Talk

Applications of Web Search Session AnalysisRosie Jones

AkamaiBoston, USA

Biography

Rosie Jones is Director of Computational Advertising Research at Akamai Tech-nologies. She has worked on research in web search session analysis for nine years,making major research and product contributions over an industry research ca-reer including senior research scientist positions at at Overture Technologies andYahoo! Labs. Her work in this area spans applications in sponsored search ad-vertising, learning relevance from clicks, geographic search and privacy, as wellas detecting user satisfaction and user frustration. She holds 5 issued patentsin applications of web search session analysis, and over 20 publications in thisarea. Her research interests include computational advertising, web search, geo-graphic information retrieval, and natural language processing. She received herPhD from the School of Computer Science at Carnegie Mellon University underthe supervision of Tom Mitchell, where her doctoral thesis was titled Learning toExtract Entities from Labeled and Unlabeled Text. She co-organized the WSDM2009 Workshop on Web Search Click Data (WSCD09) and has given tutorialsat SIGIR and WWW. She has served on the Senior PC for SIGIR between 2007and 2011, and is a Senior Member of the ACM.

Discussion Panel

Information Retrieval Evaluation over Query SessionsKalervo Jarvelin, Stephen Robertson, Tetsuya Sakai, Mark Sanderson

Biographies

Kalervo Jarvelin is Professor and Vice Chair at the School of Information Sci-ences, University of Tampere, Finland. He holds a PhD in Information Studies(1987) from the same university. He was Academy Professor, Academy of Fin-land, in 2004-2009. Jarvelin’s research covers information seeking and retrieval,linguistic and conceptual methods in IR, IR evaluation, and database manage-ment. He has coauthored over 250 scholarly publications and supervised sixteendoctoral dissertations. His H-index is 25 in Google Scholar and 12 in Web-Of-Science (January 2011). He has been a principal investigator of several researchprojects funded by EU, industry, and the Academy of Finland. Jarvelin has fre-quently served the ACM SIGIR Conferences as a program committee member(1992-2009), Conference Chair (2002) and Program Co-Chair (2004, 2006); andthe ECIR, the ACM CIKM and many other conferences as pc member. He is anAssociate Editor of Information Processing and Management (USA). Jarvelinreceived the Finnish Computer Science Dissertation Award 1986; the ACM SI-GIR 2000 Best Paper Award) for the seminal paper on the discounted cumulatedgain evaluation metric; the ECIR 2008 Best Paper Award) for session-based IRevaluation; the IIiX 2010 Best Paper Award) for a study on task-based informa-tion access; and the Tony Kent Strix Award 2008 in recognition of contributionsto the field of information retrieval

Stephen Robertson is a senior researcher at Microsoft Research, Cambridge.In 1998, he was awarded the Tony Kent STRIX award by the Institute of Infor-mation Scientists. In 2000, he was awarded the Salton Award by ACM SIGIR.He is a Fellow of Girton College, Cambridge. At Microsoft, he runs a groupcalled Information Retrieval and Analysis, which is concerned with core searchprocesses such as term weighting, document scoring and ranking algorithms,combination of evidence from different sources, and with metrics and methodsfor evaluation and for optimisation. His main research interests are in the designand evaluation of retrieval systems. He is the author, jointly with Karen SparckJones, of a probabilistic theory of information retrieval, which has been moder-ately influential. A further development of that model, with Stephen Walker, ledto the term weighting and document ranking function known as Okapi BM25,which is used in many experimental text retrieval systems. Prior to joining Mi-crosoft, he was at City University London, where he retains a part-time positionas Professor of Information Systems in the Department of Information Science.He was Head of Department for eight years, during which time it achieved thehighest possible rating in two successive research assessment exercises. He also

IV

started the Centre for Interactive Systems Research, the main research vehicleof which is the Okapi text retrieval system, which has done well at TREC.

Tetsuya Sakai received a Master’s degree from Waseda University in 1993 andjoined Toshiba in the same year. He received a Ph.D from Waseda in 2000 for hiswork on information retrieval and filtering systems. From 2000 to 2001, he wasa visiting researcher at the University of Cambridge Computer Laboratory. In2007, he left Toshiba and became Director of the Natural Language ProcessingLaboratory at NewsWatch, Inc. In 2009, he joined Microsoft Research Asia. Heis an Evaluation Co-chair of NTCIR and is currently co-organising the NTCIR-91CLICK and INTENT tasks. He has served as a Senior PC member for ACMSIGIR, CIKM and AIRS. He is on the editorial board of Information Processingand Management and that of Information Retrieval the Journal. He has receivedseveral awards in Japan, mostly from IPSJ.

Mark Sanderson is a Professor at RMIT University in Melbourne where he isthe head of the Information Retrieval Analysis and Retrieval group. He is par-ticularly interested in evaluation of search engines, but also works in geographicsearch, cross language IR (CLIR), summarisation, image retrieval by captions,word sense ambiguity. He has published over many years on the topic of evalua-tion and in tracking users behavior in search logs. With others, he initiated theimageCLEF, geoCLEF and TREC session tracks.

The Use of Domain Modelling to ImprovePerformance Over a Query Session

Deirdre Lungley, M-Dyaa Albakour, and Udo Kruschwitz

School of Computer Science and Electronic Engineering,University of Essex, Wivenhoe Park, Colchester, U.K.

Abstract. This paper addresses the question of how to improve searchresults by incorporating previous interactions with the search engine withinthe same session. It explores the usefulness of Formal Concept Analysisto derive knowledge structures that represent information needs of userswithin a Web search session. Using TREC 2010 session track data as anevaluation platform we discuss the improvements that can be achieved inthe retrieval performance over a user session.

Keywords: Information Retrieval, Domain Modelling, Formal ConceptAnalysis, Query Expansion, TREC Session Track

1 Introduction

The inclusion of a Session Track in TREC 2010 was a recognition of the impor-tance of the session, over the individual query, in eliciting the true informationneed of the user. One of the goals of this track was to test whether systemscould improve their performance for a given query by utilizing prior history. Forpragmatic reasons this initial Session Track was limited to two-query sessionsand further limited to three query types: specification, generalisation and drift-ing [6]. However, even with these constraints the session track resulted in a veryvaluable resource that can be used to explore the usefulness of various querymodification approaches applied to the test collection.

In this paper we propose to utilise a knowledge structure, a Formal ConceptAnalysis (FCA) lattice, to represent the user information needs within a searchsession. This is done by extracting from the lattice phrases or terms with whichto expand the session’s reformulated query. This work builds on our researchgroup’s submission to the TREC 2010 Session Track [1] and on previous workwith our FCA methodology [7] to produce a browsable concept lattice that hasproven to be useful for building interactive search interfaces.

In this paper we describe how these lattice structures can be built and usedto solve the tasks introduced in the session track of TREC 2010. We reportpromising results with reference to results reported by TREC participants [6].

2 Related Work

The Session Track Overview document briefly describes the methodologies par-ticipants employed in their submitted runs [6]. An analysis of these summaries

2 The Use of Domain Modelling to Improve Performance Over a Query Session

and the related results achieved illustrate the state of the art in search over ses-sion queries. Methods used included term weighting, pseudo-relevance feedback,ranked list re-ranking and merging and query expansion. The latter featuresprominently in the more successful submissions, including that from our groupand therefore this paper again avails of it. Resources which participants used tohelp in the task of deriving expansion terms included: Query logs, anchor logs,Open Directory Project (ODP) categories and WordNet.

Our group’s submission used the anchor text for the ClueWeb09 category Bdataset made publicly available by the University of Twente1. Using the anchorlog the top common associated queries of both queries in the session were ex-tracted using Fonseca’s association rules [3]. Then the reformulated query wasexpanded with the extracted phrases and terms and the original query, givinghigher weights to the reformulated query. The results of this earlier experiment(our TREC run – essex3) have been included in this paper for comparison pur-poses.

The use of anchor logs in this way proved effective, however they are notalways available and can be computationally expensive to process. In the ex-periments reported here we therefore worked purely with data returned by thesearch engine, these documents were processed into a lattice using FCA and thenused to extract query expansion terms.

3 Methodology

150 query pairs were provided by the organisers of the Session track, 52 specifi-cation reformulations, 50 drifting and 48 generalisation. This track requires thesubmission of three ranked lists: One over the initial query (RL1), one over thequery reformulation, ignoring the initial query (RL2), and one over the queryreformulation taking into consideration the initial query (RL3). Track task G1uses ranked lists RL2 and RL3 to evaluate the ability of systems to utilise priorhistory. Track task G2 uses ranked lists RL1 and RL3 to evaluate the quality ofthe ranking function over the entire session.

For our experiments with these query pairs we adopted a similar approachto what we used in our TREC submission [1] in that we were reformulating thequery based on the previous query and concepts extracted automatically whichare linked to both queries. We used the existing Indri index of the ClueWeb09dataset which is searchable via a public web service2. Ranked lists RL1 andRL2 are obtained by simply submitting the original query and the reformulatedquery respectively to the Indri search engine. The language model based Indrisearch engine supports query expansion, via its weighted belief operators. Theseallow us to assign varying weights to expressions and so formulate our querywith which to obtain RL3.

1 http://wwwhome.cs.utwente.nl/hiemstra/2010/anchor-text-for-clueweb09-category-a.html

2 http://boston.lti.cs.cmu.edu:8085/clueweb09/search/catb/lemur.cgi

The Use of Domain Modelling to Improve Performance Over a Query Session 3

Our query reformulation used to generate RL3 involved adding concepts gen-erated from FCA lattices. For all query pairs in this experiment, both the initialquery and the reformulated query were used to generate search engine calls. Thedocuments returned by each call provided the input to an FCA lattice algorithm,generating two lattice concept structures. We embarked on our experimentationwith two main hypotheses. Firstly, that lattice concepts generated from docu-ments returned by a search engine over ClueWeb09 would be more discriminatingthan those generated by documents returned by a WWW search API. Secondly,that using a combination of common lattice concepts and distinct lattice con-cepts depending on the distribution of terms between the query pairs would bemore discriminating than using solely distinct concepts.

We selected the top three concepts with which to expand the reformulatedquery. By top three concepts, we mean those, in FCA terminology [4], with thelargest extent – those contained within the largest number of documents. UsingIndri’s belief operators we assigned a 0.7 weight to the reformulated query and acombined 0.3 weight to the original query and the expansion terms. These are thesame weights used in previous Indri experiments and also adopted in our TRECsubmission (essex3). The following is an example of one of our reformulatedqueries for the query pair “hoboken:hoboken nightlife”:

#weight(0.7 #combine(hoboken nightlife) 0.3 #combine(hoboken

#1(hoboken bars) #1(hoboken wine bar) #1(hoboken gay)))

The Indri search engine returned 1000 documents and we applied the Water-loo Spam rankings3 available for the ClueWeb09 dataset, removing documentswith spam scores of 70% or less as recommended by the rankings’ creators [2] .

We conducted a number of initial runs and report here on our optimumrun, one which used Microsoft’s Live Search API to generate the query latticesand used distinct concepts. By distinct concepts we mean those generated bythe second query and not by the first, except in the case of clear generalisationqueries (the first query subsumed the second). Here we used the distinct conceptsgenerated by the first query.

4 Evaluation

For these experiments we employ the same baseline (essex1) as used in ourTREC submission. This baseline represents the simplest way of using previoususer interaction, i.e. submitting a query formed by the union of all terms in boththe original query and the reformulated query to generate ranked list RL3. Weinclude in our results, for comparison purposes, those results our group achievedin the actual TREC Session track (essex3).

The evaluation scripts provided by the track organisers allow us to reportusing variants on Jarvelin et al.’s [5] normalised session DCG (nsDCG) and on

3 http://durum0.uwaterloo.ca/clueweb09spam/

4 The Use of Domain Modelling to Improve Performance Over a Query Session

the standard nDCG. Using nsDCG allows us to incorporate a cost for reformu-lating a query. The particular instantiation of these metrics is detailed in theSession Track Overview Paper [6]. We report these evaluation metrics in a num-ber of ways. Table 1 details the percentage average improvement we achievedfrom [email protected] to [email protected]. This shows an improvement overthe baseline method for our lattice methodology. We also include two tablessimilar to those provided in the Session Track Overview paper to allow compar-ison with similar work. Table 2 details the nsDCG@10 scores of the session RL1→ RL3 ([email protected]). This indicates how the various systems performedover the entire session.4 Table 3 details the nsDCG@10 scores for the RL1 →RL2 and RL1 → RL3 sessions, [email protected] and [email protected] respec-tively and the nDCG@10 scores for each ranked list. This indicates the abilityof the retrieval systems to utilise past user queries to improve the results overthe current query.

System All topics Spec. Gen. Drift.

Essex1 7.32 -13.61 12.85 23.32Essex3 19.83 0.20 14.37 44.36MSN Lattice 17.35 0.99 9.48 40.74

Table 1. % average increase from [email protected] to [email protected]

[email protected] all sessions Specification Generalisation Drifting

essex1 0.2233 0.1456 0.2538 0.2738essex3 0.2249 0.1481 0.2531 0.2763Lattice 0.2193 0.1461 0.2455 0.2690CengageS10R1 0.2377 0.1827 0.2606 0.2723

Table 2. Results showing performance over the entire session, RL1 → RL3

The TREC Session Track results demonstrated that it was very hard to getany measurable improvement at all when utilizing the search history. Successfulutilisation of the history of user requests to increase performance regarding thecurrent request, was achieved by 11 out of the 27 runs submitted by all par-ticipants to this track, focusing on the nsDCG@10 RL12 → nsDCG@10 RL13.However, only one submission was statistically significant. As illustrated in thesetables, our lattice methodology achieved a measurable but not statistically sig-nificant improvement in performance (denoted by ↑).

4 CengageS10R1 is the best performing system in TREC 2010

The Use of Domain Modelling to Improve Performance Over a Query Session 5

nsDCG@10 nDCG@10Run RL12 → RL13 RL1 RL2 → RL3

essex1 0.2154 ↑ 0.2234 0.2077 0.2215 ↑ 0.2353essex3 0.2154 ↑ 0.2249 0.2077 0.2215 ↑ 0.2461Lattice 0.2154 ↑ 0.2193 0.2077 0.2215 ↑ 0.2373

Table 3. Results showing ability of system to utilise prior history.

5 Discussion

The difficulty of improving results for a second query given only the first ishighlighted in the Session Track Overview document. Given the fact that fewgroups managed to report any positive results for the task [6], the improvementsachieved with our lattice methodology makes it a promising step towards utilisingprior search history in user search sessions. An interesting observation we madewhile running experiments was that query pairs with fewer expansion termsoften performed better and a late experiment proved that the expansion term“wikipedia” was particularly beneficial. In fact, expanding every query pair withthis term gave us a statistically significant improvement over the session RL1 →RL3 nsDCG@10 (0.2154 to 0.2301). Future experiments must strive to improveon this new baseline result.

6 Acknowledgements

The main author would like to acknowledge EPSRC doctoral funding and thisresearch is also part of the AutoAdapt research project. AutoAdapt is fundedby EPSRC grants EP/F035357/1 and EP/F035705/1.

References

1. Albakour, M.D., Kruschwitz, U., Niu, J., Fasli, M. Autoadapt at the Session Trackin TREC 2010. In Proceedings of TREC 2010, NIST, 2011. To Appear.

2. Cormack, G., Smucker, M., Clarke, C. Efficient and Effective Spam Filtering andRe-ranking for Large Web Datasets. In CoRR, 2010.

3. Fonseca, B. M., Golgher, P.B., de Moura, E.S., Ziviani, N. Using association rules todiscover search engines related queries. In Proceedings of the First Latin AmericanWeb Congress, pages 66–71, 2003.

4. Ganter, B., Wille, R. Formal Concept Analysis - Mathematical Foundations.Springer, 1999.

5. Jarvelin, K. and Kekalainen, J. IR evaluation methods for retrieving highly relevantdocuments. In Proceedings of SIGIR, pages 41–48, New York, NY, USA, 2000.

6. Kanoulas, E., Carterette, B., Clough, P., Sanderson, M.: Session Track 2010Overview. In Proceedings of TREC 2010, NIST, 2011. To Appear.

7. Lungley, D., Kruschwitz, U. Automatically Maintained Domain Knowledge: InitialFindings. In Proceedings of ECIR, Toulouse, 2009.

Implicit relevance feedback from a multi-stepsearch process: a use of query-logs

Corrado Boscarino, Arjen P. de Vries, Vera Hollink, and Jacco vanOssenbruggen

Centrum Wiskunde & Informatica (CWI), Science Park 123,1098 XG Amsterdam, The Netherlands

[email protected], [email protected], [email protected], [email protected]

Abstract. We evaluate the use of clickthrough information as implicitrelevance feedback in sessions. We employ records of user interactionswith a commercial news picture portal: issued queries, clicked images,and purchased content. Our study investigates how much of a session’ssearch history (if any) should be used in a feedback loop. We assessthe benefit of using clicked data as positive tokens of relevance to thetask of estimating the probability of an image to be purchased. We findthat a short history of past queries helps improve ranking, and that termsderived from clicked documents lead to a much higher effectiveness, whileblind relevance feedback is not beneficial for the task.

1 Evidence of user interaction: Query Logs (QL)

Logs of queries issued and the subsequent interactions with the query results,briefly referred to as ‘query logs’ (QLs) in this paper, provide a basis to adapta relevance model to reflect what we have learned about the user’s informationneed. A set of QLs recorded when subscribers to Belga Picture1 were searchingfor images to be purchased online, allows us 1) to investigate how valuable clicksare as source of (implicit) relevance feedback in a multi-step search session and2) to observe how much search history (if any) may lead to an improvement inthe ranking of what we believe to be a determinately relevant document: thepicture that a user is known to have purchased at the end of a search session.

A QL registers, for each session, three types of user interactions: query sub-missions (Q), a possibly empty set of clicks (C) on the retrieved results, and,purchases (P); an anonymous identifier labels each step. Previous studies di-verge in their findings about how much evidence of user interactions (Q andC) should be used for feedback: Tan et al. report in [5] that long term searchhistory may improve web retrieval, while the authors of [4] argue to emphasizeshort-term query context. Also, Gong et al. question whether clicked data shouldbe accepted as positive evidence of a document being relevant without a quality

1 A European news agency: http://picture.belga.be/picture-home/index.html,log data collected within the VITALAS project: http://vitalas.ercim.org.

II

Fig. 1. Descriptive parameters of a search session: length L = 8, current observationat step l = 6 , gap to a purchase m = 1 and history window W = 1, 2, 3, max.

metric [2], while Joachims et al. report on user studies where the quality of im-plicit feedback from clicks can compete with explicit judgements [3], especiallywhen the additional burden on users in providing explicit feedback is taken intoaccount.

In the pilot study described here, we consider a scenario where the retrievalsystem is expected to take advantage of the recorded query session, in an attemptto rank the user’s purchases on top, early on in the session. This task is not trivial,as in any practical setting, the system does not know the total length L of thesession. We define the observation gap m as the number of steps between thecurrent interaction, observation step l, and the actual purchase P . Like sessionlength L, m is not observable at state l. The open parameter that the systemcan choose is the size of query history window W . Fig. 1 summarizes our viewon query logs, and the notation used in the paper.

2 Adding clicked documents as additional query terms

In [1], Balog et al. describe a series of experiments that compare the effective-ness of various language modelling approaches that exploit query expansion fromexplicit user feedback. The original formulation in [1] explores different assump-tions about the cognitive process of selecting a set of relevant documents forfeedback: they are taken to be grasped by a variation on a two steps generativeprocess. When we apply this model to our scenario, viewing clicks as if theywere explicit relevance assessments, the best performing setting according to theevaluation of [1] would first select a picture annotation from a set of clicks withprobability P (d|C) and subsequently pick a term from that annotation withprobability P (t|d). We follow [1] in not making any additional hypothesis aboutthe dependence between queries and clicks, hence the click probability is uniformand the probability of finding a term t in the clicked annotation will be

P (t|C) =∑d∈C

P (t|d) · P (d|C) =∑d∈C P (t|d)|C|

. (1)

Term frequency #(·) in an unsmoothed maximum likelihood estimate is a crudemeasure of term importance P (t|d) within a document, yielding

P (t|d) = PML(t|d) =#(t, d)∑τ∈d #(τ, d)

. (2)

III

In this setting, using the top K terms with highest probability P (t|C) toformulate an expanded query simply corresponds with using the K most frequentterms from each clicked document, considering each click equally important. Inour case however, due to the relatively short picture annotations, after excludingstop words even for small K a large part of the expansion terms will be chosenamong terms that appear just once in the document: we need an additionalhypothesis to ‘break the ties’ may the most frequent terms be exhausted.

Qualitatively, we noticed how users formulate a query mostly based on pre-viously examined documents. We make therefore the additional hypothesis thatterm importance also depends on the degree of surprise that a user experienceswhen reading the annotation, and discriminate between unique terms based onan entropy metric. We expand then a query Q into a new query Q with theK most frequent terms in clicked annotations weighted by the value of their γ-encoding for the entropy of the term distribution; since γ-encoding is prefix free,none of the document’s terms will have exactly the same weighted frequency.

3 Evaluation

For our experiments on the Belga data, we have assumed session boundarieswhenever the period of inactivity between two successive actions exceeded a 15-minute timeout. Queries are defined as the complete strings that are submitted inthe search box, and split into query terms considering whitespace as delimiter.The Belga data contain 1003 sessions that consist of 3 to 13 steps before apurchase is observed, the subset that we use in our experiments. The distributionof sessions over session length L is given in Table 1.

Table 1. Distribution of sessions versus session length L.

Session Length L=3 L=4 L=5 L=6 L=7 L=8 L=9 L=10 L=11 L=12 L=13# Sessions 210 184 150 113 110 64 56 36 42 23 15

Using Lemur2 out of the box, we retrieve 1000 images from the Belga collec-tion and estimate the effectiveness of ranking the purchased pictures per session.As some sessions have recorded multiple purchases, we opted to report on MeanAverage Precision (MAP), but the trends in the results are identical when re-porting Mean Reciprocal Rank (MRR).

We compare three methods. The baseline method (Q) simply considers as aquery the union of all the query terms in windowW . The blind relevance feedbackmethod (BRF) expands the query constructed as in the baseline method withthe top 3 expansion terms from the top 5 ranked documents. The final method(Q+C) applies the query expansion approach described in the previous section,expanding the query produced in the baseline method with the K = 5 expansionterms derived from each of the clicked documents.2 http://www.lemurproject.org/

IV

We simulate a system that operates under real conditions: i.e., with unknownsession length L (and thus unknown observation gap m), attempting to improvethe ranking of the purchased images. We vary window length (W ∈ {1, 2, 3, l}),and compare the success of ranking the purchases with the three methods de-scribed.

Our evaluation aims to single out the effect of the unobservable m whileassessing the dependence on the history window parameter W . We first aggregatethe performance over all observation gaps m (results summarized in Table 2).

Table 2. Average MAP over m: the MAP at W = 1, m = 0 is 0.0578.

MAP Q only Q + C BRF

W = 1 0.0114 0.0448 0.0106W = 2 0.0144 0.0526 0.0127W = 3 0.0144 0.0573 0.0134W = l 0.0114 0.0483 0.0104

Rank bias turns out hard to overcome in these experiments based on querylogs: the MAP of 0.0578 obtained by a system not using any feedback (W =1,m = 0), is superior to all other settings. If we however look into the settingswhere we ‘look ahead’ (the observation gap m > 0), then we conclude that ashort-term context gives the best performance.

Next, we investigate the performance of the system as function of the dis-tance to a purchase, considering a fixed observation gap and averaging only ontodifferent session lengths. The top part of Fig. 2 shows that the baseline effective-ness (using past queries only) mainly depends on the distance to P , irrespectiveof the amount of history taken into account. Blind relevance feedback leads toslightly worse results. The bottom part of Fig. 2 demonstrates how click historycould be a useful source of positive relevance feedback: more click history ranksthe true purchases higher, with results for W = l more than twice as effectiveas the setting ignoring session history (W = 1). We also see that using clicksreduces the dependence of effectiveness obtained upon the unobservable m. (Un-fortunately, with the exception of the previously mentioned case of m = 0; whichwe attribute to rank bias.)

4 Conclusions and future work

We have investigated the effectiveness of using QL data to improve the retrievalperformance, and the quality of clicked data as a source of implicit relevancefeedback. The preliminary results obtained show that information about previ-ous user interactions with the system may help improve overall performance.Predicting a purchase early in the session remains difficult, but taking a moder-ate amount of session information into account seems beneficial. Our conclusionsare preliminary, as they are based on relatively straightforward methods - we

V

Fig. 2. MAP for W = 1, 2, 3, l with (bottom; Q+C) and without (top; Q) additionalterms from the clicked data; (+) in the top figure plots BRF results.

have not weighted query terms, and did not tune all parameters of the retrievalmodel to the collection. Apart from exploring better methods for query repre-sentation, we would like to relax the assumption that a session relates in itsentirety to a single, static information need, and investigate in more depth therelation between the documents visited and the subsequent user queries issued.

References

1. K. Balog, W. Weerkamp, and M. de Rijke, “A few examples go a long way: con-structing query models from elaborate query formulations,” in SIGIR, 2008.

2. B. Gong, B. Peng, and X. Li, “A personalized re-ranking algorithm based on rele-vance feedback,” in Advances in Web and Network Technologies, and InformationManagement, 2007, vol. 4537, pp. 255–263.

3. T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay, “Accurately interpret-ing clickthrough data as implicit feedback,” in SIGIR, 2005.

4. G. Pandey and J. Luxenburger, “Exploiting session context for information retrieval- a comparative study,” in Advances in Information Retrieval. Springer Berlin /Heidelberg, 2008, vol. 4956, pp. 652–657.

5. B. Tan, X. Shen, and C. Zhai, “Mining long-term search history to improve searchaccuracy,” in Proceedings of SIGKDD, 2006, pp. 718–723.

Same Query – Different Results?A Study of Repeat Queries in Search Sessions

Johannes Leveling and Gareth J. F. Jones

Centre for Next Generation Localisation (CNGL), School of ComputingDublin City University, Dublin 9, Ireland{jleveling, gjones}@computing.dcu.ie

Abstract. Typically, three main query reformulation types in sessionsare considered: generalization, specification, and drift. We show thatgiven the full context of user interactions, repeat queries represent animportant reformulation type which should also be addressed in sessionretrieval evaluation. We investigate different query reformulation pat-terns in logs from The European Library. Using an automatic classifi-cation for query reformulations, we found that the most frequent (andpresumably the most important) reformulation pattern corresponds torepeat queries. We aim to find possible explanations for repeat queries insessions and try to uncover implications for session retrieval evaluation.

1 Introduction

There are two different approaches to information retrieval (IR) evaluation. Oneis the traditional Cranfield paradigm, where IR systems are evaluated indepen-dent of any search context, i.e. the same query will yield the same, reproducibleresults, regardless of individual users or search context. The other approach orig-inates from studies in human computer interaction, where evaluation is basedon user interactions with a system, which poses the challenge to design thesestudies so that reproducible results can be obtained. There is growing interest inbridging the gap between these approaches by investigating query reformulationsand evaluating IR in context (see, for example, the TREC session track [1]).

In this paper we show that repeat queries should be considered as an impor-tant query reformulation pattern in session retrieval evaluation. Typically, themain categories of query reformulations include generalization, specification, anddrift [2, 1, 3]. To investigate the importance of different reformulation patterns,we perform an analysis of reformulation types in sessions. We base our analysis onuser interaction logs from The European Library (TEL). He et al. [3] and Jansenet al. [4] both propose similar algorithms for automatic classification of queryreformulations. We classify query reformulations following the latter algorithm,which involves examining common, added, deleted, and modified terms for twosuccessive query formulations. To our surprise, the automatic classification ofreformulations revealed that repeat queries are the most frequent reformulationpattern in sessions. Similar to the Excite logs [2], navigating multi-page result

lists appears to generate repeat queries in the TEL log. However, even if theseare discarded – as in our additional analysis – repeat queries still make up themost frequent reformulation pattern. This poses a major problem for session re-trieval evaluation because, to the best of our knowledge, evaluation experimentsand metrics do not explicitly address repeat queries. For example, the issue ofrepeat queries raises such questions as: should the same result list be returned tothe user for two identical successive queries in the same session? We aim to findpossible explanations for repeat queries in sessions and implications for sessionretrieval evaluation.

2 Related Work

Most publications on repeat queries are concerned with reformulations in websearch and do not concern repeat queries within sessions. For example, Teevanet al. [5] found in an analysis of web log data that 40% of all queries are re-findqueries. He et al. [3] observe that the second most frequent type of reformulationin sessions is repeat queries, but they do not provide an explanation. (Note thatwe do not consider the most frequent type in their study, browsing, as queryreformulation). Sanderson and Dumais [6] report that 80% of repeat queries inweb search are navigational queries. They find that repetition of queries often oc-curs on the same day. The smallest time interval examined in their investigation(except for periodicity studies) is a single day.

Repeat queries may have different causes: repeat queries in different sessionstypically correspond to a known-item search and occur often in web search [5].Repeat queries in sessions are unlikely to represent known-item search, due tothe different time frame (i.e. less than 30 minutes between actions), and areissued by the same user. Some log entries with repeated queries are generatedautomatically by a system when the user navigates the multi-page search results(e.g. Excite, see [2]) and does not actually enter a query, but clicks on a link.

Huang and Efthimidiadis give an excellent overview of different query refor-mulation patterns in the literature [7]. In the session track at TREC 2010, thefocus was on three major patterns of reformulation [1]: 1. generalization (wherethe original results were too narrow and the reformulation into a more generalquery can be achieved by deleting a term), 2. specification (where the originalresults are too broad and the reformulation into a more specific query can beachieved by adding a term), and 3. drifting (where the reformulation is anotherquery aiming at a different aspect of the information need). So far, repeat querieshave not been considered in the session track.

However, most of the related work does not focus on repeat queries in thesame session, but on repeated queries in different sessions (e.g. in web search)and by different users [5, 6].1 None provides possible explanations as to why thesemay occur within a session.

1 Note that Spink et al. defined repeat queries as all multiple occurrences of the samequery that represent requests for multi-page viewing [2].

Table 1. TEL 2009 action log statistics.

# actions (total): 1866330# queries (total): 86981# sessions (total): 20325avg. # actions per session: 15.68avg. # queries per session: 4.28

3 Session Analysis

TEL Data. The analysis described in this paper is based on the TEL 2009 actionlogs (queries and user interactions), which were employed at LogCLEF 20092, atask for analyzing query logs. Table 1 shows statistics on the TEL log data.

Session reconstruction. The TEL logs were preprocessed to reconstruct usersessions by grouping all actions with the same (unique) session ID together andsorting them by timestamp. We define a session as a consecutive sequence of alluser interactions, presuming the start of a new session if the time between con-secutive actions exceeds 30 minutes. The 30 minute time interval originates fromthe working definition of session reconstruction in previous editions of LogCLEF.

Thus, sessions comprise actions such as changing the interface languages,selecting the target collection, submitting a query, and viewing results. Sessionsusing the TEL advanced search3, sessions containing non-English queries, andsessions with less than two queries were filtered out because we did not want toover-emphasize the importance of context changes (since switches between theadvanced and simple search are frequent) or language-specific aspects.

Analysis of reformulations In the context of session retrieval, we only considerdirectly consecutive queries as potential repeat queries and disregard all othersimilar queries within a session as possible repeat queries. We applied the auto-matic classification approach described in Jansen et al. [4] to identify the mostimportant query reformulation patterns in sessions. We made only minor modifi-cations to the algorithm (see [4] for details), i.e. using the edit distance for singleword queries to determine if they are related and applying the classification onnormalized queries, after case folding, stopword removal, and stemming. Thesechanges were made to include spelling corrections, case folding, and morpholog-ical normalizations as query reformulations. Synonyms and abbreviations havenot been explicitly considered as reformulations due to the open domain of theTEL target collections. Table 2 shows results for the reformulation classification.The most surprising result is that repeat queries correspond to the most frequentquery reformulation pattern in sessions (i.e. not changing the query). So far, re-peat queries in session evaluations have generally been ignored, but they raisethe question as to why users should repeat queries within sessions.2 http://www.uni-hildesheim.de/logclef/3 TEL supports simple (keyword) search and advanced (structured) search, which

aims at different metadata fields (e.g. author, title) and employs Boolean operators.

Table 2. Frequency of query reformulation patterns in sessions.

Reformulation pattern Frequency

# specifications: 1410# specifications+reformulation: 584# generalizations: 1196# generalizations+reformulation: 590# reformulations: 2072# content changes: 4608# repeat queries (with diff. context): 8942# empty queries: 386

Σ 19788

# repeat queries (including navigation): 50963

Explaining repeat queries. One reason why repeat queries have been ignored inevaluation is the assumption that the same query should return the same results.However, if the full search context is considered, there are actually repeat queriesfor which different results are expected by the user, for example when a differenttarget collection is selected or language-specific settings are changed.

There are several explanations why users repeat queries. Often, a part of thesearch context, which includes the viewed documents and the selected targetlanguage or collection, is changed. In web search, users typically do not changesearch options or choose different target collections. In general, repeat queries inweb search mostly occur in different sessions, where submitting identical queriesrepresents a known-item search [5], i.e. re-finding a result with a previous query.

One explanation for repeat queries in the same session is that the user’s stateof mind changes after viewing result documents and he assumes that the systemkeeps track of his search findings and will return different results for the samequery. At least some users might expect that results change for the same query,because they are unfamiliar with a search system and their mental model of theinformation need changes after viewing some results. For example, log analysison the TEL logs has shown that users often modify queries regardless of queryprocessing steps, for example by adding or deleting stopwords, which are removedby the TEL system [8]. If this mismatch between actual and expected systembehavior exists – even for a small group of users – session retrieval evaluationmetrics should take the search context into account (e.g. by employing usermodels). For example, users would expect to find the same (or very similar)results when they submit the same query on the next day to re-find previousresults. When they submit the same query in the same session, they may bemore interested in result diversity and expect different results. Current sessionevaluation does not reflect this and ignores search context.

One possible explanation for repeat queries is that users submit a query againsimply by mistake. However, the number of repeat queries is much too high toaccount for all of them by mistakes. An explanation found in web search researchis that users use previous queries as an “anchor point” to jump back to the firstresult page, before exploring other aspects of the topic. This explanation implies

a subsequent reformulation of the query so that each repeat should be followedby at least one reformulation on another aspect of the topic. This reformulation“meta-pattern” was not observed in the TEL logs. However, many users mightnavigate results and browse the same page or intermediate results several timesin a session.

4 Implications for Session Retrieval Evaluation

We think that to build reusable test collections for session IR, more contextof queries in sessions has to be captured. The search context includes languagesettings, the selected target collection, viewed documents, and user interactionsother than queries (e.g. changing search options). The TEL log data alreadycaptures this information and provides additional metadata for evaluation (e.g.timestamps).

We also think that realistic data is a necessary prerequisite for research onsession retrieval. So far, most (to our best knowledge: all) session retrieval modelsignore repeat queries in sessions, although they are the most frequent reformu-lation pattern in sessions.

As part of the future work, we intend to carry out user studies to createreal logs and experiments on different log data, to see if the observed results arespecific to the log data used in this paper or can be observed in web search orin other domains such as enterprise search.

Acknowledgements

This research is supported by the Science Foundation Ireland (grant 07/CE/I1142)as part of the Centre for Next Generation Localisation (http://www.cngl.ie).

References

1. Kanoulas, E., Clough, P., Carterette, B., Sanderson, M.: Session track at TREC2010. In: SIGIR 2010 Workshop on Simulation of Interaction. (2010)

2. Spink, A., Wolfram, D., Jansen, B.J., Saracevic, T.: Searching the web: The publicand their queries. JASIST 52(3) (2001) 226–234

3. He, D., Goker, A., Harper, D.J.: Combining evidence for automatic web sessionidentification. Information Processing & Management 38(5) (2002) 727–742

4. Jansen, B.J., Booth, D., Spink, A.: Patterns of query reformulation during websearching. ASIS&T 60(7) (2009) 1358–1371

5. Teevan, J., Adar, E., Jones, R., Potts, M.A.S.: Information re-retrieval: Repeatquerie in Yahoo’s logs. In: SIGIR 2007. (2007) 151–158

6. Sanderson, M., Dumais, S.T.: Examining repetition in user search behavior. In:ECIR 2007. (2007) 597–604

7. Huang, J., Efthimiadis, E.N.: Analyzing and evaluating query reformulation strate-gies in web search logs. In: CIKM 2009. (2009) 77–86

8. Ghorab, M.R., Leveling, J., Zhou, D., Jones, G.J.F., Wade, V.: Identifying commonuser behaviour in multilingual search logs. In: CLEF 2009. (2010) 518–525

New user profile learning for extremely sparse

data sets

Tomasz Hoffmann, Tadeusz Janasiewicz, and Andrzej Szwabe

Institute of Control and Information Engineering, Poznan University of Technology,pl. Marii Curie-Skladowskiej 5, 60-965 Poznan, Poland

{tomasz.hoffmann,tadeusz.janasiewicz,andrzej.szwabe}@put.poznan.pl

http://www.put.poznan.pl

Abstract. We propose a new method of online user profile learningfor recommender systems, that deals effectively with extreme sparsityof behavioral data. The proposed method enhances the singular valuesrescaling method and uses a pair of vectors to represent both positive andneutral user preferences. A list of discarded elements is used in a sim-ple implementation of negative relevance feedback. We experimentallyshow the negative impact of dimensionality reduction on the accuracyof recommendations based on extremely sparse data. We introduce anew method for recommendation quality evaluation that involves on themeasurement of F1 performed iteratively during a simulated session. Thecombined use of the singular value rescaling and the user profile repre-sentation based on two complementary vectors has been compared withthe use of well-known recommendation methods showing the superiorityof our method in the online user profile updating scenario.

Keywords: Recommender systems, user profile learning, collaborativedata sparsity, vector space model, cold-start problem, relevance feedback

1 Introduction

The main purpose of many recommender systems is to recommend items to usersin the interactive web environment [6], [7]. Behavioral data sparsity makes theeffective online interaction between users and a recommender system an espe-cially challenging task [3]. To our knowledge, there are only few algorithms fornew user profile learning that are oriented towards dealing with extremely sparsedata sets. As shown in [2], data sparsity is a severe limitation for the effectivenessof methods based on dimensionality reduction [6]. In the classical vector spacemodel a user profile is represented by a vector that aggregates vectors of all itemsselected by the user [1], [6]. In that case no additional information about unse-lected items is used, i.e., only ’positive’ preferences are stored. Such an approachto user profile modeling has a significant impact on recommendation accuracy.

We assume that the purpose of personalized recommendation is to identifytopN products that are the most relevant to the user [8]. Following this assump-tion, in this paper we investigate a double vector representation of a user profile,

2 T. Hoffmann, T. Janasiewicz, A. Szwabe

that takes into account the sparsity of data set [3]. We compare the proposedmethod to a few widely-used methods, such as collaborative filtering, ratingsprediction and popularity-based item recommendation. We propose to estimateitem relevance as a dot product between a user vector and item vector weightedby means of a probability model. Finally, we evaluate the presented method byusing the F1 measure [6].

2 Evaluation of iterative user profile updating methods

We propose binary representation of the ratings [6]. Taking the perspective of thefind-good-items task [3], we assume that what is important is not how much theuser likes a given product, but the fact that she or he was interested in it. To ourknowledge, there has been no research in to the direct impact of the dimension-ality reduction process on recovered matrix. We propose applying concentrationcurves [9] to visualize ratings distribution before and after dimensionality reduc-tion.

We evaluate the quality of recommendations by performing an F1 measure-ment [6] after each user action. The parameter denoted as x determines thenumber of ratings in the training set [6]. The interaction with a new user is sim-ulated by iterative ’shifting’ of user’s ratings from the training set to the test set.Initially, the most popular items are recommended for all compared methods.Next, the user selects the first item and the system generates a recommendationlist by performing the following steps: 1) items that were discarded by the userare added to DL (Discarded List – a list of discarded items), 2) the user profileis updated according to the evaluated method, 3) and a recommendation list isgenerated using method.

3 Evaluated recommendation methods

We compare our approach to a few well-known recommendation methods. Firstly,we evaluate the most popular item method (MP), which, as shown in [2], caneffectively cope with data-set sparsity. Secondly, we use collaborative filtering

(SVD-CF) that is based on the vector space model [3], [6]. The method usesSVD (Singular Value Decomposition) to obtain users’ and items’ vectors. Whenapplying this method, we use the first 20 dimensions (k = 20) to find latent cor-relations between users, and to identify the 30 nearest neighbors (kNN = 30).Moreover, we compare our approach to the rating prediction method (SVD-RP)[6] as well as to a variant with average values removed from the input matrix(SVD-RPav)[6].

The solution proposed in this paper is referred to as the complementary

spaces method (CSM). The first step of the algorithm is to decompose binaryinput matrix Am×n. As a results of this decomposition, three matrices U , S andV are obtained, where U – is a matrix containing users’ vector ui, V – is amatrix containing items’ vectors vi and S – is a diagonal matrix of the singularvalues of A, denoted as σi. Our approach is based on representing a user profile

New user profile learning for extremely sparse data sets 3

by means of two vectors containing user’s positive and neutral preferences. Asshown in [10], an extension of user profile representation may improve the rec-ommendation quality. In the case of our method, the vectors representing a userprofile are built as a sum of vectors of the rated items set and the unrated itemsset, respectively −→u p+ =

∑i∈IR

−→v i, −→u p− =∑

i∈INR

−→v i, where −→v i denotes thei-th item vector, IR is a set of items rated by user and INR is a set of itemsunrated by user. We propose using a simple probabilistic model based on theone proposed in [8] in order to weight the ’importance’ of each part of a userprofile. Each dimension of the vector space corresponds to the probability value,proportionally to the square of the respective singular value σi. For all vectorsin the space, we compute the value of the probability based on the followingassumptions:1) probability distribution is defined as

−→d = [di], where di = (σ2

i )/(∑

j∈I σ2j ),

I = {1, 2, · · · ,min ∈ (m,n)}, i ∈ I and∑

i∈I di = 12) probability value related to an item vector is equal to P (−→v j) =

∑i∈I v

2j,i · di,

where∑

j=1...m v2i,j = 1This model is based on the quantum probability framework proposed in [4].It permits us to weight parts of the user profile by using appropriate proba-bility values, determined by means of the singular values distribution. We im-plemented negative relevance feedback [5] that is based on the assumption thatelements recommended by a system and discarded by the user are no moreuseful during the session. All the discarded items are stored on a list denotedas DL. Our singular values rescaling method is based on the probabilistic in-terpretation of vectors coordinates. Firstly, distribution

−→d is prepared. Sec-

ondly, we compute a superposition of squared vectors representing items se-lected or rated by the user, called user square profile −→u sqp =

∑i∈IR

v2i . Next,

the user square profile is used to scale−→d and to obtain a new distribution

−→d new = mul(−→u sqp,

−→d ), where mul denotes an element multiplication operation.

The relation between−→d new and

−→d is represented by a vector of coefficients (each

corresponding to a particular dimension), denoted as −→w scale = div(−→d new ,

−→d ),

where div denotes an coordinate-by-coordinate division, and is used to scale the

coordinates of items vectors from matrix V . Respectively, we compute−→w′

scale =

div(−→d′ new ,

−→d ) where d′new = sub(

−→d ,

−→d new), and sub is a subtraction of vector

coordinates. Next, these coefficients are used to scale the user profile vectors−→u new+ = mul(−→w scale,−→u p+),−→u new− = mul(

−→w′

scale,−→u p−) and items vectors

Vnew = V · diag(−→w scale), V′

new = V · diag(−→w′

scale), where diag denotes the diag-onal matrix in which a given vector forms the diagonal. According to the userprofile representation, we obtain two lists denoted as −→r 1 = sqr(−→u new+ · Vnew),−→r 2 = sqr(−→u new− ·V ′

new). Next, we obtain two probabilities p1 = mul(−→u sqp,−→d )

and p2 = 1− p1 for both profile vectors. These probabilities are used as weightsfor similarity vectors −→r 1 and −→r 2. Thus, the final form of the similarity vectoris as follows: −→r = p1 · −→r 1 + p2 · −→r 2. As a result of our algorithm, the system isable to recommend items from both the positive and the neutral list, applyingan appropriate proportional weighting.

4 T. Hoffmann, T. Janasiewicz, A. Szwabe

4 Experiments

We used a well-known MovieLens ML100k data set, which has accompaniedby widely-referenced experimental results, e.g., [6], [7]. To analyze the charac-teristics of the data set we used concentration curves[9] and applied SVD atdifferent k-cut values. As shown in Fig. 1, in the case of extremely sparse datasets, dimensionality reduction has a negative impact on the number of ratingsappearing in recovered data sets. In such a case, each dimension corresponds toone of disjoint subsets, which reduce the number of item/user subsets that mayappear in recommendation lists.

0

10

20

30

40

50

60

70

80

90

100

40 50 60 70 80 90 100

cum

ula

tiv

e %

of

rati

ng

s

cumulative % of users

k = 10 k = 20 k = 100 k = 200 k = 943

0

10

20

30

40

50

60

70

80

90

100

40 50 60 70 80 90 100

cum

ula

tiv

e %

of

rati

ng

s

cumulative % of users

k = 10 k = 20 k = 100 k = 200 k = 943

Fig. 1. Rating concentration curves for ML100k, x = 0.004 (on the left), x = 0.008 (onthe right).

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0 2 4 6 8 10 12 14 16 18 20

F1

@1

0

the number of iterations

CSM SVD-RPav MP SVD-RP SVD-CF

Fig. 2. Recommendation accuracy for x = 0.004.

5 Conclusions

The results of the experiments show that – as far as the online user profileupdating scenario is concerned – the proposed method performed better thanseveral widely used methods. In the analyzed online sessions (in both cases ofx = 0.004 and x = 0.008), the CSM method allowed us to achieve even 10percent gain in the recommendation accuracy over the second best method -this result is shown in Fig. 2 and Fig. 3. The method based on item popularity(MP) allowed us to provide comparatively good recommendations when therewas a higher amount of behavioral data in the train-set: for x = 0.004 MP

New user profile learning for extremely sparse data sets 5

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0 2 4 6 8 10 12 14 16 18 20

F1

@1

0

the number of iterations

CSM MP SVD-RPav SVD-RP SVD-CF

Fig. 3. Recommendation accuracy for x = 0.008.

performed similarly to SVD-RPav, while for x = 0.008 the difference betweenthe quality of MP and the quality of SVD-RPav was much more visible. SVD-CFmethod was the worst one in both analyzed cases. An important contribution ofthis paper is the demonstration of a strong negative impact that dimensionalityreduction has on the recommendation quality when it is applied to extremelysparse data sets, as shown in Fig. 1.

6 Acknowledgments

This work is supported by the Polish Ministry of Science and Higher Education,grant N N516 196737.

References

1. Berry, M., Dumais, S. and O’Brien, G.: Using linear algebra for intelligent infor-mation retrieval, SIAM Rev. 37, 573-595, (1995)

2. Gedikli, F. and Jannach, D.: Recommending based on rating frequencies, 4th ACMconference on Recommender systems, RecSys ’10, Spain, (2010)

3. Herlocker, J.L., Konstan, J.A., Terveen, L.G. and Riedl, J.T.: Evaluating Collabo-rative Filtering Recommender Systems, ACM Trans. Inf. Syst., 22, 1, 5-53, (2004)

4. Rijsbergen, C. J. van: The Geometry of Information Retrieval. Cambridge Univer-sity Press, New York, NY, USA, (2004)

5. Sandler, M. and Muthukrishnan, S.: Monitoring algorithms for negative feedbacksystems, WWW ’10, Raleigh, North Carolina, USA, (2010)

6. Sarwar B. M., Karypis G., Konstan J. A. and Riedl J.: Application of dimension-ality reduction in recommender system - a case study, WebKDD, (2000)

7. Shani, G. and Gunawardana, A.: Evaluating Recommender Systems, November,Microsoft Research, Redmond, USA, (2009)

8. Varshavsky R., Gottlieb A., Linial M. and Hornl D.: Information extraction novelunsupervised feature filtering of biological data, Bioinformatics, (2006)

9. Zhang M. and Hurley N.: Niche Product Retrieval in Top-N Recommendation,WI-IAT ’10, Washington, DC, USA, (2010)

10. Zhang, M. and Hurley, N.: Novel Item Recommendation by User Profile Partition-ing, WI-IAT ’09, Washington, DC, USA, (2009)

An Exploration of Query Term Deletion

Hao Wu and Hui Fang

University of Delaware, Newark DE 19716, [email protected], [email protected]

Abstract. Many search users fail to formulate queries that can wellrepresent information needs. As a result, they often need to reformulatethe queries in order to find satisfying results. Query term deletion is oneof the commonly used strategies for query reformulation. In this paper,we study the problem in the context of TREC session track. We first showthat the reformulated queries used in the session track would lead to lesssatisfying search results than the original queries, which suggets that thecurrent simulation strategy used in the session track for term deletionmight not be a good method. Moreover, we also examine the potentialof query term deletion, propose two simple strategies that automaticallyselect terms for deletion, and discuss why the simulation strategy forterm deletion fails to reformulate better queries.

1 Introduction

It is often difficult for search users to describe their information needs usingkeyword queries. Given an information need of a user, the first query formulatedby the user may not be effective to retrieve many relevant documents, and theuser often needs to reformulate queries in order to find more satisfying searchresults. Existing studies on query log analysis show that around half of thequeries are re-formulated by users [2, 1].

Query deletion is one of the commonly used query reformulation strategies.Intuitively, query deletion can benefit over-specified queries so that the returnedsearch results would have greater coverage of relevant information. Moreover,query deletion can also benefit ill-specified queries so that non-relevant docu-ments that are attracted by noisy query terms would not be returned. A com-mon belief is that search users are capable of selecting terms for deletion and thereformulated queries would lead to more satisfying search results. Thus, existingwork on query deletion mainly focused on how to predict which terms would bedeleted by search users during query reformulation [1].

In this paper, we focus on the problem of query term deletion in the contextof TREC query session track. Specifically, we first analyze the query sessionsin the TREC 2010 query session track, and compare the retrieval performanceof the original queries with those of the reformulated queries through queryterm deletion. We find that query reformulation based on term deletion hurtthe performance. The failure analysis suggests that the simulation used by thesession track tends to delete more terms than necessary.

We also conduct a set of experiment to study the potential of query termdeletion. Specifically, we aim to answer the following question: if the decisionon which terms to be deleted could be made correctly, would it be possible toimprove the retrieval performance with the reformulated queries? Experimentresults suggest that query term deletion, if done appropriately, can significantlyimprove the retrieval performance.

Motivated by the potential of query term deletion on improving retrievalperformance, we explore two simple strategies that aim to automatically deletequery terms. The first strategy aims to delete less important terms while thesecond one aims to delete query terms that over-specify the information needs.Empirical evaluation shows that both strategies are effective in improving re-trieval performance.

2 Analysis of Term Deletion in Query Sessions

In this section, we conduct experiments to study whether search users are capableof selecting appropriate query terms for deletion. We use the collection providedin TREC 2010 session track [2]. The data set we used is ClueWeb 09 Category Bcollection. There are 150 topics and each of them have two queries: one is originalquery and the other is reformulated query. Out of these queries, 26 topics arereformulated through query term deletion and 22 have relevant judgments. Ourexperiments are based on these 22 topics. The retrieval function is DirichletPrior method [3] with µ = 500. The retrieval performance is evaluated based onnDCG@20.

Table 1. An example for ideal query term deleting

Candidate queries after term deletion Retrieval Performance (nDCG@20)

disneyland 0.026special 0.000offers 0.000

disneyland special 0.017disneyland offers 0.166special offers 0.000

Table 2. Comparison among different versions of queries

Queries Performance(nDCG@20) Average Query Length

Original 0.160 4.455

Reformulated 0.138 2.455

Ideal 0.252 3.091

We conduct two sets of experiments. First, we compare the retrieval perfor-mance for the original and reformulated queries in order to see whether simulatedsearch users are good at term deletion. Second, we examine the potential of queryterm deletion. The goal is to study, if done appropriately, whether query termdeletion can lead to better retrieval performance. Specifically, for every query, wegenerate all possible candidate queries after term deletion, evaluate the retrievalperformance for each of the queries, and choose the one with best performance.

For example, consider query “disneyland special offers”, the value of nDCG@20is 0.103. Table 1 lists all the possible candidate queries after term deletion andtheir corresponding performance. The candidate query with best performance isreferred to as the ideal query in the paper.

Table 2 shows the performance comparison between using the original queries,reformulated queries and ideal queries. We can see that simulated search userson average would delete two query terms, which is larger than that for idealqueries. Moreover, we make two interesting observations.

– The retrieval performance for the reformulated queries is worse than thatfor the original queries. This observation suggests that simulated users maynot capable of deleting query terms that hinder retrieval performance. Thus,the simulation strategy used in the session track might not be a good one.

– If we could have an oracle that always make the correct decisions on whichterms to delete, the retrieval performance can be improved significantly.

Clearly, it would be interesting to study what are effective strategies for queryterm deletion.

3 Methods for Automatic Term Deletion

In this section, we study the methods for automatic query term deletion.When formulating a query, users may include terms that are less effective in

identifying relevant information. Thus, one possible solution is to delete queryterms based on their importance, which is referred to as Imp. One commonmeasure of term importance is the IDF value of a term, so we can delete queryterms with lower IDF values.

Over-specifying an information need is a common reason for query termdeletion. Thus, we need to delete terms that causes over-specification withoutchanging the information need. One possible solution is to delete terms thatare more semantically related to other terms, which is referred to as Sim. Therational is that deleting this kind of terms would hurt the performance lessbecause the related terms are still able to bring a lot of relevant information.Following previous studies [4], we compute the term semantic similarities basedon the expected mutual information of a term pair over the document collection.Thus, in the second method, we delete query terms with higher average mutualinformation values.

We have discussed two strategies for query term deletion. One is to keepdeleting query terms with lowest IDF, and the other is to keep deleting termswith highest average MI values. The remaining challenge is to decide when tostop term deletion. We explore two strategies. In the first strategy, we use athreshold as a cut-off. It means that we delete terms whose IDF values are lowerthan a threshold or delete terms whose MI values are higher than a threshold.In the second strategy, we use the percentage of the number of original queryterms as a cut-off. It means that we stop term deletion when the percentage ofthe query terms that are deleted is larger than the cut-off.

Since we have two term deleting strategies and two stop criteria, there are fourmethods to be compared. Table 3 shows the results of performance comparison.

Table 3. Performance Comparison

Imp + threshold Cut-offs IDF < 0(no deletion) IDF < 0.5 IDF < 1.0 IDF < 1.5 IDF < 2.0nDCG@20 0.160 0.168 0.167 0.154 0.148

Imp + percentage Cut-offs 0% (no deletion) 20% terms 30% terms 40% terms 50% termsnDCG@20 0.160 0.177 0.176 0.145 0.134

Sim + threshold Cut-offs MI > ∞ (no deletion) MI > 2.5 MI > 2.0 MI > 1.5 MI > 1.0nDCG@20 0.160 0.160 0.163 0.179 0.098

Sim + percentage Cut-offs 0% (no deletion) 20% terms 30% terms 40% terms 50% termsnDCG@20 0.160 0.163 0.156 0.150 0.106

We see that both term deletion strategies can improve retrieval performance ifwe choose proper parameters. When deleting terms based on term importance(i.e., IDF), using the percentage of query terms as a cut-off is more effective.The reason is that longer queries tend to include more less important terms.Thus, if we delete terms based on their importance, the number of deleted termsshould be proportional to the original query length, i.e., we need to delete moreterms for longer queries and fewer terms for shorter queries. When deletingterms based on the semantic similarities (i.e., MI), using the threshold yield tobetter performance, because not all queries are over-specified and it is unfair todelete the same percentage of terms for every query. In summary, deleting 20%of the original query terms seems to always improve the retrieval performance.This is consistent with the results in Table 2. In particular, we suggest to delete20% of the original query terms with lowest IDF values because it gives goodperformance and the parameter value is not collection-dependent.

4 Why did simulated users fail?

In this section, we discuss why the simulation used in the session track fails tochoose appropriate terms for deletion.

Table 4. Comparison between reformulated queries and the ideal queries

terms deleted by the simulation terms deleted by “oracle”

Total Number 44 30

Average IDF 2.484 2.373

Average MI 1.150 1.335

We first look into what kind of terms are deleted by the simulation duringreformulation and what kind of terms should be deleted. Table 4 shows the termstatistics in these two situations. We see that simulated users also tend to deleteterms with lower IDF and high MI values. But simulated users often delete moreterms than necessary.

We then conduct per-query analysis to compare query formulated by simu-lation with the ideal queries, and classify simulated user mistakes into three cat-egories: (1) over-deletion: delete more terms than necessary; (2) under-deletion:delete fewer terms than necessary; (3) error-deletion: delete wrong terms. Table5 shows our summary about these three error types. It is interesting to see thatthe number of over-deletion errors is high. Since it is much easier to make error-deletion than over-deletion or under-deletion, it shows that the simulated users

Table 5. Analysis of Errors

Error types Num. of queries Examples

Original queries Ideal queries Users’ reformulated queries

Over-deletion 9 diabetes education diabetes education diabetes educationvideos books videos books

Under-deletion 1 windows defender reports defender problems windows defendersoftware problems problems

Error-deletion 8 wall hung toilet kohler hung toilet kohler wall hung toilet

has ability to differentiate the usefulness of query terms, but they delete termstoo aggressively.

To summarize our finding so far, the simulation is good at deciding whichterms to delete but they tend to delete more terms than necessary. This isprobably the main reason why the reformulated queries perform worse thenthe original ones. To further verify our hypothesis, we conduct another set ofexperiments. We force the system to delete the same number of terms as thesimulation’s deletion for every query. The results are shown in Table 6. It isclear that the human are smarter at selecting which terms to delete than thesystem based on the two proposed term deletion methods. Thus, we expect thatif simulated search users are more conservative in deleting query terms, thereformulated queries should lead to more satisfying search results.

Table 6. Performance comparison when deleting the same number of terms as thesimulation did for every query

simulation’s deletion deleting terms with lowest IDF deleting terms with highest MI

nDCG@20 0.138 0.107 0.120

5 Conclusions and Future Work

We study the problem of query deletion in the context of TREC query sessiontrack. We propose two simple strategies for automatic term deletion, and exper-iment results show that both methods are effective if the parameter values areset appropriately. We also find that the poor query reformulation of simulatedsessions is caused by the fact that the simulation tends to delete more termsthan necessary while it is good at picking terms that need to be deleted.

In the future, we plan to conduct similar studies on real query sessions andover multiple collections. We also plan to explore other strategies for automaticquery term deletion.

References

1. R. Jones and D. C. Fain. Query word deletion prediction. In Proceedings of SI-

GIR’03, 2003.2. E. Kanoulas, B. Carterette, P. Clough, and M. Sanderson. Session track overview.

In Proceedings of TREC’10, 2010.3. C. Zhai and J. Lafferty. A study of smoothing methods for language models applied

to ad hoc information retrieval. In Proceedings of SIGIR’01, 2001.4. W. Zheng and H. Fang. Query aspect based term weighting regularization in infor-

mation retrieval. In Proceedings of ECIR’10, 2010.

Query Session Data vs. Clickthrough Dataas Query Suggestion Resources

Makoto P. Kato1, Tetsuya Sakai2, and Katsumi Tanaka1

1 Kyoto University, Kyoto, Japan,[email protected] , [email protected]

2 Microsoft Research Asia, Beijing, China,[email protected]

Abstract. Query suggestion has become one of the most fundamental featuresof Web search engines. Some query suggestion algorithms utilize query sessiondata, while others utilize clickthrough data. The objective of this study is to exam-ine which of these two resources can provide more effective query suggestions.Our results show that query session data outperforms clickthrough data in termsof clickthrough rate.

Keywords: query suggestion, query log, query session, clickthrough, Web search.

1 Introduction

Query suggestion, which enables the user to revise a query with a single click, has be-come one of the most fundamental features of Web search engines. Providing effectivequery suggestions to the user is very important for helping him express his informationneed precisely so that he can access the required information. To this end, some querysuggestion algorithms utilize query session data, while others utilize clickthrough data.However, to date, it is not clear which of these two resources can provide more effectivequery suggestions. Using a large-scale query suggestion usage log from a commercialsearch engine and some simple query suggestion selection methods, we show that querysession data outperforms clickthrough data in terms of clickthrough rate.

2 Related Work

Query suggestion algorithms that rely on query session data include the following.Boldi et al. [3] and Anagnostopouloset al. [1] selected query suggestions from queriesthat are likely to follow a given query. Heet al. [6] proposed a query suggestion algo-rithm based on the resemblance between the user’s query sequence and query sequencehistory.

Whereas, query suggestion algorithms that rely on clickthrough data include thefollowing. One approach is clustering queries based on their clicked URLs, and select-ing query suggestions from the query cluster to which the original query belongs [2, 5].Another approach is incorporating a random walk in a query and clicked URL bipartitegraph [7–9].

3 Query Suggestion Selection Methods

To select query suggestions for a given query, we introduce four types of query rankingmethods. Given a queryq ∈ Q, each queryq′ ∈ Q is ranked, and the topn queries arechosen as query suggestions.

3.1 Session-based Ranking Methods

A basic approach to generating query suggestions from query session data is to findqueries that often follow or are followed by a given query within the same session asseen in some work [1, 3, 6]. Thus, we usedwithin-session reformulation probabilityandits symmetric variant to rank queries for a given query by using query session data.

For queriesq andq′, within-session reformulation probabilityPsession(q′|q) is de-

fined as follows:

Psession(q′|q) = f(q, q′)

f(q), (1)

wheref(q) denotes the number of occurrences of a queryq in the query session data,andf(q, q′) denotes the number of co-occurrences ofq andq′ within the same session,whereq′ occurs afterq. The probabilityPsession(q

′|q) can be interpreted as how likelya queryq′ follows q within the same session.

Similarly, symmetric within-session reformulation probabilityP symsession(q, q

′) is de-fined as follows:

P symsession(q, q

′) =f(q, q′) + f(q′, q)

f(q) + f(q′), (2)

which represents how likely a queryq′ follows or is followed byq within the samesession.

3.2 Click-based Ranking Methods

A straightforward approach to generating query suggestions from clickthrough data isbased on the similarity of URLs clicked in response to a query. Some studies usedthe clicked-URL similarity [2, 5], while others incorporate a random walk in a queryand URL bipartite graph [7–9]. Thus, we usedclick-based similarityandclick-basedtransition probabilityfor selecting query suggestions from clickthrough data.

For queriesq andq′, click-based similaritySimclick(q, q′) is defined as follows:

Simclick(q, q′) =

∑u∈U w(q, u)w(q′, u)√∑

u∈U w(q, u)2√∑

u∈U w(q′, u)2, (3)

wherew(q, u) represents how many times a URLu has been clicked in response to aqueryq, andU is the whole set of URLs.

Click-based transition probability fromq to q′, Pclick(q′|q), is defined as follows:

Pclick(q′|q) =

∑u∈U

w(q, u)∑u∈U w(q, u)

w(q′, u)∑r∈Q w(r, u)

, (4)

Table 1.Data statistics

# of queries 22,212,088# of unique queries11,216,354# of sessions 774,096

(a) Query session data.

# of clicks 20,686,083# of unique queries7,161,456# of unique URLs 11,317,011

(b) Clickthrough data.

# of unique queries 110,805,265# of unique suggestions876,537,724Total impression count3,572,451,604Total clicked count 18,935,221

(c) Query suggestion log.

which represents the probability of a transition fromq to q′ via a URL in a bipartitegraph, where queries and URLs are nodes, and each edge is weighted byw(q, u).

4 Experiments

4.1 Data

Query session and clickthrough data were collected from October 1st to October 10th,2009 through Microsoft Internet Explorer. Query session data are triplets comprising asession id, a query and a timestamp, while clickthrough data are query-URL pairs. Thestatistics of the two data are shown in Table 1(a) and (b), respectively.

To evaluate the quality of query suggestions generated from the two data resources,we utilized a query suggestion log between May 2-8, 2010 from Microsoft’s Bingsearch engine3. The query suggestion log contains records whose fields arequery, querysuggestion, impression count(how many times the suggestion was shown), andclickedcount(how many times the suggestion was clicked.) The statistics are shown in Table1(c).

4.2 Evaluation Method

Query suggestion clickthrough rate (QSCTR) is a metric used to evaluate the qualityof query suggestions. We define QSCTR of a<query, suggestion> pair as itsclickedcountdivided by itsimpression count. QSCTR can be interpreted as the probability thata user clicks on a query suggestion given in response to a query. The average QSCTRover our query suggestion log is 0.00632.

A natural evaluation method for the four query suggestion ranking methods wouldbe to let the session-based methods rank all queries from the query session data andlet the click-based methods rank all queries from the clickthrough data (Table 1(a) and(b)). However, as we need the click and impression counts for each query suggestionfor evaluation, we only rank the<query, suggestion> pairs contained in the query sug-gestion log (Table 1(c)).

4.3 Results and Discussions

Average QSCTRs for topk <query, suggestion> pairs are shown in Figure 1. We cansee that session-based methodsPsession andP sym

session outperform click-based methods3 http://www.bing.com/

0.005

0.01

0.015

0.02

0.025

0 100000 200000 300000 400000 500000Q

SCT

RValue of k

Psession

Psym

Simclick

Pclick

session

Fig. 1.Average QSCTRs for topk query-suggestion pairs.

Simclick andPclick. Moreover,<query, suggestion> pairs ranked high by the session-based methods achieved high QSCTR, while there is little correlation between ranks bythe click-based methods and QSCTR. The symmetric session-based method appears tobe the overall winner. This result suggests that considering not onlyforward reformu-lations but alsobackwardreformulations is effective for improving QSCTR. Note alsothat this strategy can suggest more queries than the forward-only method.

To drill down into the overall result, we utilized a query reformulation classificationschema proposed by Boldiet al. [4]. Let X, Y , andZ denote nonempty sets of queryterms. Thus, we definespecializationas a transition from a query represented byX tothat represented byX ∪ Y ; generalizationas a transition fromX ∪ Y to X (or Y );parallel movementas a transition fromX ∪ Y to X ∪ Z; anderror correctionas areformulation where the Levenshtein distance between a query and reformulated queryis less thanθ (2 in this study). All other reformulations are classified asnew.

Figure 2 shows the QSCTR results per query reformulation type (generalization isomitted due to small sample size.) Figure 2(d) shows that session-based methods arenot effective for thenew type: it underperforms click-based methods for largek. Inaddition, Figure 2(b) shows that QSCTRs of session-based methods are relatively lowfor theparallel movementtype. For the other two query reformulation types, the generalpicture is similar to the overall results shown in Figure 1.

5 Conclusion

Using a large-scale query suggestion usage log, we compared some query suggestionselection methods based on query session and clickthrough data. Our experimental re-sults strongly suggest that query session data is superior to clickthrough data as a querysuggestion resource.

Acknowledgments

This work was supported in part by the following projects and institutions: a Kyoto Uni-versity GCOE Program entitled “Informatics Education and Research for Knowledge-Circulating Society,” Grants-in-Aid for Scientific Research (No. 18049041) and JSPSFellows (No. 10J04687) from MEXT of Japan.

0.005

0.01

0.015

0.02

0.025

0 20000 40000 60000 80000 100000 120000 140000

QSC

TR

Value of k

Psession

Psym

Simclick

Pclick

session

(a) Specialization.

0.005

0.01

0.015

0.02

0.025

0 50000 100000 150000 200000

QSC

TR

Value of k

Psession

Psym

Simclick

Pclick

session

(b) Parallel movement.

0.005

0.01

0.015

0.02

0.025

0 10000 20000 30000 40000 50000

QSC

TR

Value of k

Psession

Psym

Simclick

Pclick

session

(c) Error correction.

0.005

0.01

0.015

0.02

0.025

0 20000 40000 60000 80000 100000

QSC

TR

Value of k

Psession

Psym

Simclick

Pclick

session

(d) New.

Fig. 2.Average QSCTRs for topk query-suggestion pairs per query reformulation type.

References

1. A. Anagnostopoulos, L. Becchetti, C. Castillo, and A. Gionis. An optimization framework forquery recommendation. InProc. of WSDM 2010, pages 161–170, 2010.

2. R. Baeza-Yates, C. Hurtado, and M. Mendoza. Query recommendation using query logs insearch engines. InCurrent Trends in Database Technology-EDBT 2004 Workshops, pages588–596, 2004.

3. P. Boldi, F. Bonchi, C. Castillo, D. Donato, A. Gionis, and S. Vigna. The query-flow graph:model and applications. InProc. of CIKM 2008, pages 609–618, 2008.

4. P. Boldi, F. Bonchi, C. Castillo, and S. Vigna. From Dango to Japanese Cakes: Query Refor-mulation Models and Patterns. InProc. of WI 2009, pages 183–190, 2009.

5. H. Cao, D. Jiang, J. Pei, Q. He, Z. Liao, E. Chen, and H. Li. Context-aware query suggestionby mining click-through and session data. InProc. of KDD 2008, pages 875–883, 2008.

6. Q. He, D. Jiang, Z. Liao, S. Hoi, K. Chang, E. Lim, and H. Li. Web query recommendationvia sequential query prediction. InProc. of ICDE 2009, pages 1443–1454, 2009.

7. H. Ma, H. Yang, I. King, and M. Lyu. Learning latent semantic relations from clickthroughdata for query suggestion. InProc. of CIKM 2008, pages 709–718, 2008.

8. Q. Mei, D. Zhou, and K. Church. Query suggestion using hitting time. InProc. of CIKM2008, pages 469–478, 2008.

9. Y. Song and L. He. Optimal rare query suggestion with implicit user feedback. InProc. ofWWW 2010, pages 901–910, 2010.

Query Session Detection as a CascadeExtended Abstract

Matthias Hagen, Benno Stein, and Tino Rüb

Bauhaus-Universität Weimar<first name>.<last name>@uni-weimar.de

Abstract We propose a cascading method for query session detection insearchengine logs (i.e., for finding consecutive queries a user submitted for the sameinformation need). Our approach involves different detection steps that form acascade in the sense that computationally costly features are applied only aftercheap features “failed.” This cascade is different to previous session detection ap-proaches most of which involve many features simultaneously. Our experimentsshow the cascading method to save runtime compared to the state of the art whilethe detected sessions’ accuracy is improved.

1 IntroductionWe tackle the problem of query session detection from searchengine logs. Detectingsuch search sessions is of major interest as they offer the possibility to support usersstuck in longer sessions or to learn from the query reformulation patterns. Since thequeries of one user in a log can be easily sorted by submissiontime, session detectionis often modeled as the task of determining for each pair of chronologically consecutivequeries whether the two queries were submitted for the same user information need.

We propose a new approach in form of a cascading method. Unlike former ap-proaches to the problem the cascading method does not require the simultaneous eval-uation of all features. Instead, it processes the features in different steps one after theother by increasing computational costs. Whenever a “cheap” feature allows for a re-liable decision, features with higher cost are not computed. If, however, the cheaperfeatures cannot reliably decide, more features are computed. For example, if a querycontains the preceding query (e.g.,istanbul and istanbul archaeology), it isreasonable to assume that both queries belong to the same session. No other feature be-sides a simple term overlap is needed for that decision. For more complex situations likeistanbul archeology andconstantinople, simple term overlap fails but compu-tationally more costly features that are able to identify semantic similarities can supportthe desired decision of also assigning these two queries to the same session.

A use case of session detection is the extraction of sessionsfrom a stored log inorder to obtain sessions on which improved retrieval techniques can be evaluated. Forthis purpose, many studies in the literature use a basic timethreshold since it is botheasy to implement and very fast compared to more sophisticated methods developed inthe session detection literature. Our cascading method aims at closing the gap betweendeveloped session detection methods and their applicationin practice: the cascadingmethod is only about 5 times slower than a simple time threshold check but comes witha by far more reliable session accuracy.

2 Matthias Hagen, Benno Stein, and Tino Rüb

2 Related WorkIn a recent survey, Gayo-Avello compares most of the existing session detection ap-proaches against a gold standard of human annotated queries[3]. At the same placehe introduces the geometric method and shows its superiority over the existing meth-ods in terms of detection accuracy. The geometric method involves only basic featuresand thus is very efficient. Also other approaches achieve convincing accuracy results inGayo-Avello’s study but come at the cost of evaluating many features for every pair ofconsecutive queries, leading to a bad runtime.

Session detection is often used as a pre-processing step to extract sessions from aquery log on which, in turn, particular retrieval techniques are tested. In this regardmany authors decide not to use methods developed in the session detection communitybut resort to some time threshold (like 10 or 15 minutes) between consecutive queries.Obviously, this is an easy to implement and fast way to detectsessions from a querylog. But the resulting session’s quality is not convincing [3].

Because of its efficiency the geometric method could close the gap between re-search on session detection methods and their application in practice. However, as astand-alone approach, the geometric method (by design) is not able to detect semanticsimilarities of queries. Our cascading method builds upon the geometric method andintroduces additional steps that are able to detect semantic similarities. Our objective isthreefold: a runtime performance comparable to the geometric method, a similar easeof implementation, and a further improved session accuracy.

3 The Cascade: Step by StepThis section describes the steps of our cascading session detection method. From stepto step the required features get more expensive but later steps are invoked only if theprevious steps were not able to come to a reliable decision.

Throughout our explanations we view a queryq as a set of keywords. For eachquery the search engine log additionally contains a user ID and a time stamp. If the userclicked on a result, the log also contains the clicked result’s rank and URL. We assumethat the queries of one user in the log are ordered by submission time and explain ourcascading framework for the queries submitted by one user. We model the problem ofsession detection as the problem of deciding for each consecutive pairq, q′ of querieswhether the sessions that containsq continues withq′ or whetherq′ starts a new session.

Step 1: Simple Query String Comparison

The most simple patterns that two consecutive queriesq andq′ may form can be de-tected by a simple comparison of the keywords: repetition (q = q′), generalization(q′ ⊆ q), and specialization (q ⊆ q′). Whenever two consecutive queries representone of these three cases, our approach assigns them to the same session regardless ofthe time that has passed between their submission. The rationale is that in case of alonger time between a repetition, generalization, or specialization pattern we assumethe user to have continued a pending session. While Step 1 is able to reliably detectsession continuations for repetitions, generalizations or specializations, it would decide“new session” in all other cases, which is not always correct. For these other cases our

Query Session Detection as a Cascade Extended Abstract 3

cascading method invokes the geometric method [3] in Step 2 as a more sophisticatedcomparison of the query strings.

Step 2: Geometric Method

The geometric methods relaxes Step 1’s query overlap condition with respect to theelapsed time between the queries. This way, session continuations can be detected forquery pairs that are syntactically more different than simple repetition, generalization,or specialization patterns.

Let t andt′ be the submission times of a pairq andq′ of consecutive queries. Usingthe offsett′ − t, the geometric method computes the time featureftime = max{0, 1 −t′−t24h

}. Thus, chronologically very close queries achieve scores near to1 whereas longertime periods between query submissions decrease the score until it gets 0 for querieswith a gap of 24h or larger. The syntactic similarity forq′ is computed as the cosinesimilarity fcos between the character3- to 5-grams of the queryq′ and the sessionswhose current last query isq. The geometric method votes for a session continuationiff

√

(ftime)2 + (fcos)2 ≥ 1. This decision rule can be geometrically interpreted asplotting the point(ftime, fcos) in theR

2 and checking whether it lies inside or outsidethe unit circle.

Although the geometric method detects queries with overlapping terms more reli-ably than does Step 1, there are problematic cases where the geometric method shouldnot be trusted. Such cases include query pairs whereftime is large (i.e., chronologicallyclose queries) but the syntactic similarity reflected byfcos is rather low. An exampleis the pairistanbul archeology andconstantinople from the introduction. As-suming that the session started withistanbul archeology, the only overlapping 3-to 5-grams aresta, stan, andtan but nevertheless one would expect both queries tobelong to the same session as semantically they are very similar. In pilot experimentswe determined that for query pairs withfcos < 0.4 andftime > 0.8 (queries that arechronologically close but that have a smalln-gram overlap) the geometric method’s de-cision misses many semantically similar queries and wrongly assigns them to differentsessions. Hence, for query pairs that fall in this range, ourcascading method drops thegeometric method’s decision and invokes Step 3 to further analyze semantic similarityof the current query pair.

Step 3: Explicit Semantic Analysis

An elegant way to compare semantic similarity of two texts isthe explicit semantic anal-ysis (ESA) introduced by Gabrilovich and Markovitch [2]. The idea is to not comparethe given two texts directly but to use an index collection against which similarities arecalculated. Since the index collection (e.g., the Wikipedia articles) can be preprocessedand stored, invoking ESA is not too expensive compared to thebasic session detectionfeatures such asn-gram overlap or query submission time.

The ESA principle works as follows. During a preprocessing step, atf ·idf -weightedterm-document-matrix of the Wikipedia articles is stored as the ESA matrix. Duringruntime, the two to-be-compared texts represented as vectors are multiplied with the

4 Matthias Hagen, Benno Stein, and Tino Rüb

ESA matrix and the cosine similarity of the resulting vectors yields the ESA similarity.In our setting, the two texts that should be ESA-compared arethe keywords ofq′ andall the keywords of the queries in the sessions to which the previous queryq belongs.As Anderka and Stein [1] showed that the ESA accuracy varies only very little with thesize of the index collection, we conducted a pilot experiment with different numbers ofWikipedia articles and finally used a sample of 100 000 as the ESA index collection.

One problem of ESA applied to queries is that the texts that are compared are rathershort. Hence, to have a reliable decision, we only use ESA to detect session continu-ations and choose an ESA similarity threshold of 0.35 that has to be achieved as anargument for a session continuation. Given the case that theESA similarity is belowthe threshold, we do not immediately viewq′ as the start of a new session but viewESA’s decision as “not sure” and invoke Step 4 of the cascade that aims at enlarging therepresentation of the queries that are compared.

Step 4: Search Result Comparison

Step 4 uses the web search results of the queriesq andq′. Since retrieving these resultsrequires index accesses at search engine site or the submission of two (time consuming)web queries from an external client, the web results are applied only if all previous stepsfailed to provide a reliable decision.

Using web search results to detect semantically similar queries is not a new idea(cf. [4] for example) but is applied in different variants. Some authors use the URLsof the retrieved documents, others fetch the complete documents. Moreover, differentnumbers of search results are used in the literature, ranging from the top-10 documentsup to the top-50. We evaluated different settings in a pilot study and finally chose tocompare the sets of URLs of the top-10 retrieved documents via the Jaccard coefficient(ratio of common top-10 URLs ofq andq′). Whenever the Jaccard coefficient is at least0.1 (i.e.,q′ returns at least one of the top-10 results ofq), we view this as an argumentfor a session continuation. Otherwise,q′ is treated as the start of a new session.

4 Experimental EvaluationTo ensure comparability, we evaluated our cascading methodon the annotated goldstandard query corpus that Gayo-Avello used in his experiments [3]. The corpus con-tains 11 484 queries of 223 users sampled from the AOL query log. The queries aremanually subdivided into 4 254 sessions with an average of 2.70 queries per session.

For evaluation purposes we use theF -MeasureFβ = (1+β2)·prec·rec

β2·prec+rec

, where precisionand recall for the detected sessions are measured against the human gold standard. Wefollow Gayo-Avello and setβ = 1.5, which emphasizes wrong session continuationsas the bigger problem compared to wrong session breaks. Gayo-Avello reports anF -Measure of0.9184 for his geometric method and we could verify his results in our ex-periments. The cascading method improves upon this value and achieves anF -Measureof 0.9323.

Our experiments reveal that Step 2 requires about 2.25 timesmore time for a querypair analysis than Step 1. Hence, on the about 40% of the queries that Step 1 detects asrepetitions, generalizations, or specializations, our cascade saves time compared to the

Query Session Detection as a Cascade Extended Abstract 5

geometric method (also note that for these queries, Step 1 always correctly votes for asession continuation). After Step 2, about 75% of the queries are reliably judged (F -Measure of0.9184), such that Step 3, which requires about 1.08 times more timethanStep 2 with a preprocessed ESA matrix in main memory, is invoked on only 25% ofthe queries. Hence, after Step 3, our cascading approach is still faster than the originalgeometric method. The only crucial issue for runtime is Step4, which we implementedagainst the Bing API and which requires more than 20 times theruntime of Step 1.Step 4 is invoked on about 22% of the queries but increases theF -Measure only slightly:after Step 3 we already achieve0.9315. I.e., when efficiency is an issue, Step 4 canbe omitted, still having both an improved session accuracy compared to the originalgeometric method and improved runtime. However, also note that at search engine siteStep 4 can be operationalized at much higher efficiency,

Since a potential use case of our method is session extraction from a stored log file(e.g., in a pre-processing to obtain sessions on which some improved retrieval tech-niques should be evaluated), we also suggest a very fast second version of our method(without Step 4) that assigns a “not sure, maybe new session”decision when Step 3votes for a new session. A post-processing can remove all sessions involved in a “notsure” decision (this removes about 22% of all the queries). The remaining sessionsthen achieve anF -Measure of0.9755. I.e., this version of the cascading method withpost-processing can be used as a very fast and reliable session extraction method thatproduces high quality sessions resembling the ones a human would have extracted.

5 Conclusion and OutlookWe have presented a cascading session detection approach, based on the geometricmethod and the well-known ESA retrieval model, which is ableto achieve very highquery session detection accuracy against a human gold standard. Our method sensiblyinvokes time consuming features only when cheaper featuresfailed to provide a reliablesession detection. Equipped with a post-processing step that drops sessions with “notsure” decisions, the accuracy of our approach is almost perfect on Gayo-Avello’s goldstandard. Hence, the system could be applied as a pre-processing for many evaluationsthat need to extract high quality sessions from query logs asexperimental data.

An interesting aspect for future research is to invoke a post-processing that is ableto account for multitasking at user site (cf. [4]). The goal then is to merge sessions intoa hierarchy that resembles different levels of search goalsand missions.

Bibliography[1] M. Anderka and B. Stein. The ESA retrieval model revisited. In Proceedings of

SIGIR 2009, pages 670–671.[2] E. Gabrilovich and S. Markovitch. Computing semantic relatedness using Wikipedia-based

explicit semantic analysis. InProceedings of IJCAI 2007, pages 1606–1611.[3] D. Gayo-Avello. A survey on session detection methods inquery logs and a proposal for

future evaluation.Information Sciences, 179(12):1822–1843, 2009.[4] R. Jones and K.L. Klinkner. Beyond the session timeout: automatic hierarchical

segmentation of search topics in query logs. InProceedings of CIKM 2008, pages 699–708.

Automatic Generation of Query Sessions usingText Segmentation

Debasis Ganguly, Johannes Leveling, and Gareth J.F. Jones

CNGL, School of Computing, Dublin City University, Dublin-9, Ireland{dganguly, jleveling, gjones}@computing.dcu.ie

Abstract. We propose a generative model for automatic query refor-mulations from an initial query using the underlying subtopic structureof top ranked retrieved documents. We address two types of query re-formulations a) specification where the reformulated query expresses amore particular information need compared to the previous query; andb) generalization where the query is reformulated to retrieve more gen-eral information. To test our model we generate the two reformulationvariants starting with topic titles from the TREC-8 ad hoc track as theinitial queries. We use the average clarity score as a specificity measureto show that the specific and the generic query variants have a higherand lower average clarity score respectively. We also use manual judge-ments from multiple assessors to calculate the accuracy of the specificityand generality of the variants, and show that there exists a correlationbetween the relative change in the clarity scores and the manual judge-ments of specificity.

1 Introduction

Traditional Information Retrieval (IR) models assume that the information seek-ing process is static and involves successively refining a query to retrieve docu-ments relevant to the original information need. However, observational studiesof information seeking find that searchers’ information needs change as they in-teract with a search system. Searchers learn about the topic as they scan retrievalresults and term suggestions, and formulate revised information needs as previ-ously posed questions are fully or partially answered [1]. The topic collections ofstandard ad hoc evaluation tracks fail to model this user behaviour.

The Session Track organized for the first time at TREC 2010 [2] is an effortto evaluate retrieval systems over an entire session of user queries rather thanon separate independent topics. The topic creation phase involved starting withWeb-track diversity topics sampled from the query logs of a commercial searchengine. Specific variants of the initial topic were created by manually extractingkeywords from the different subtopics. The general variants were formed in twoways: a) by constructing an over-specified query from one of the subtopics andremoving words manually, and b) by adding manually selected related words froma different subtopic. Related work on automatic query reformulation includesthat of Dang and Croft [3] which uses anchor text to reformulate a query by

substituting some of the original terms, assuming that the information need inthe reformulated query is identical to that of the initial one. Our work is differentin the sense that we seek to move the query towards a more specific subtopic ora broader topic which is associated with a change of information need.

This paper tries to answer the research questions of how to build test collec-tions to study the “Session IR” task by modeling user interactions over a session.We aim to develop topic variants on a large scale for ad hoc retrieval collectionswhich do not possess such meta-information as query logs or anchor texts. Thenovelty of the paper is that we use the underlying semantic structure of topranked retrieved documents by applying text segmentation aiming to design agenerative model for query reformulation.

2 Automatic generation of topic variants

Motivation A document retrieved in response to a query may comprise ofmultiple subtopics which are related to the more specific aspects of the infor-mation need expressed in the query. For example, the document FBIS3-20090fetched in response to the TREC query title Foreign minorities Germany con-tains a segment on Synagogue attack. If the user is interested in a more specificreformulation, he is likely to chooses terms which occur frequently in one or afew subtopics. Whereas if he is interested in a more general formulation, it ismore likely that he would choose terms which are not concentrated in one of thesubtopics but occur abundantly throughout the entire document. Applying atext segmentation based approach in our method for simulated query generationis an attempt to model this behaviour.

Jansen et. al. [4] view generalization as removal of query terms and specializa-tion as addition of terms. This is particularly true when the implicit connectivesof the query terms are strictly conjunctive in nature i.e. the relation betweenquery terms (subordination of concepts) plays an important role in determiningthe type of reformulation. For example the query “osteoporosis” (TREC topic403) could be considered as a more specific reformulation of “osteoporosis bonedisorder” but only if the underlying information need involves an implicit con-junction of all the terms in the later i.e. the searcher being not interested inother bone diseases. An alternative interpretation is that the user is interestedin bone disorders in general with a reference to osteoporosis in particular. Thus,addition of terms can also contribute to the generalization of a query if theadded terms have semantic relations such as has-a or is-a with respect to theoriginal terms i.e. there is an implied focus on one particular aspect of a query.While a possible approach to generalize a query could involve starting with alonger initial query such as the description part of the TREC topics followed bya removal or substitution of specific terms with more general ones, in this paperwe concentrate only on the additive model of query reformulation.

Specific reformulation Our generative model tries to utilize the fact that aterm indicative of a more specific aspect of an initial information need, typically

Algorithm 1 Reformulation(Q,R, ns, ng)

1: Q : The original query, R : Number of top ranked documents to use, ns : Max. #of specific terms to add from each document, ng : Max. # of general terms to addfrom each document,

2: SpExpQry ← ∅; GnExpQry ← ∅3: for i = 1 to R do4: d← ith document5: Segment d into segments {s1, s2, . . . sn} by C99 algorithm6: smax ← segment with maximum number of matching query terms7: Score each term t in smax by φ(t, smax) and add the top ns terms to SpExpQry

if the term is already not in Q.8: Score each term t of d by ψ(t) and add the top ng terms to GnExpQry if the

term is already not in Q.9: end for

10: return (SpExpQry, GnExpQry)

is densely distributed in a small part of the text [5]. We segment a documentby applying the state-of-the-art segmentation algorithm C99 [6]. To characterizespecific reformulation terms we assign scores to terms considering the followingtwo factors: a) how frequently a term t occurs in a segment s, denoted by tf(t, s),and how exclusive the occurrence of t in s is as compared to other segments ofthe same document, denoted by |S|

sf(t) , where |S| is the number of segments inthat document and sf(t) is the number of segments in which t occurs; b) howrare the term is in the entire collection, measured by the document frequency(df), the assumption being rare terms are more likely to be specific terms. Weuse a linear combination to calculate term scores, as shown in Equation 1.

φ(t, s) = a · tf(t, s)|S|sf(t)

+ (1− a) · log|D|df(t)

(1)

ψ(t) = a · tf(t, d)sf(t)|S|

+ (1− a) · log|D|df(t)

(2)

Equation 1 assigns higher values to terms which occur frequently in a segment,occur only in a few segments, and occur infrequently in the collection.

General reformulation In contrast to a more specific term, a more generalterm is distributed uniformly throughout the entire document text [5]. So anobvious choice is to score a term based on the combination of term frequency inthe whole document (instead of frequency in individual segments) and segmentfrequency (instead of inverse segment frequency) where tf(t, d) is the numberof occurrences of t in d (see Equation 2). Algorithm 1 is used to create thetwo types of reformulations of an initial query Q. Another possible approach togeneralization can involve removal or substitution of terms of higher φ(t, s) in theinitial query with those having lower ones, thus making general reformulationan inverse to specialization.

Evaluation The clarity score [7] of a query is the KL divergence between the es-timated distribution of generating the query from the top ranked pseudo-relevantdocuments and the probability of generating the query from the collection model.Since the specific version of an initial query aims at a narrower information need,we hypothesize that the clarity score of the specific version should increase. Also,for a more general information need we add terms which are expected to occurin more number of documents potentially making the query more ambiguoushence potentially resulting in a decrease of the clarity score.

3 Experiments

9.00

10.00

11.00

12.00

13.00

14.00

15.00

16.00

17.00

18.00

0 0.2 0.4 0.6 0.8 1

Avg. C

larity

a

SpecificGeneral

Initial

Fig. 1: Average clarity versus a.

We start with the titles of the TREC-8 topics as initial queries and use Al-gorithm 1 to form the variants. Theparameters were set to (R,ns, ng) =(5, 3, 2) after a set of initial exper-iments with an aim to increase theaverage clarity of the specific vari-ants and decrease that of the gen-eral ones. Figure 1 shows that for spe-cific queries we obtain the maximumclarity at a = 0.8 and for generalqueries we get the minimum clarity ata = 0.1. We generated 100 simulated queries (two variants for each TREC-8topic) with the above settings of parameter a. Five assessors manually judgedthe quality of the reformulated queries with yes/no answers. In order to seek apossible correlation between the relative changes in clarity scores and the aver-age score of the manual judgements, for every query variant QVi, obtained fromQi (i = 1 . . . n), we compute the following:

δi =clarity(QVi)− clarity(Qi)

clarity(Qi), mi =

1Na

Na∑j=1

dj (3)

In Equation 3, δi is the relative change in the clarity score and mi is the averagedecision score (dj = 1 for yes, dj = −1 for no), Na being the number of asses-sors for the ith topic. We scale the mi values by the magnitude of clarity changeto avoid ties in the mi scores across topics, and then compute the Spearman

Table 1: Accuracy of the generated query variants

Variant Accuracy Spearman Coeff. Fleiss’ κ Deduction

Specific 0.83 0.30 0.68 High accuracy, Medium correlationGeneric 0.63 -0.17 0.62 Fair accuracy, Small correlation

correlation between δis and mi|δi|s. Accuracy of the generated queries for bothvariants is measured by a majority decision (1 if majority agree and 0 other-wise) for each topic and averaging it out over all topics. The results are shown inTable 1. The Fleiss’ κ for the assessments of both the variants show a substan-tial inter-assessor agreement. For the specific variant, a good example output is“poaching, wildlife preserves (bear african tiger)”, where parenthesized words in-dicate the new words added, whereas “killer bee attacks (agricultural experts)” isindicative of an imprecise specialization. For the generalization variant “carbonmonoxide poisoning (hyperbaric chamber)” is an instance of good generalizationbut “cosmic events (religion)” is an instance of inaccurate generalization.

4 Conclusions

We have shown that the proposed model of simulated query generation can beused to produce query reformulations with 83% and 63% accuracies for the spe-cific and general cases respectively. We find that there are positive and negativecorrelations between the changes in clarity scores and the manual judgementsof specificity and generality respectively which means that most of the assessorsagree on a specific reformulation when the clarity score increases and most ofthem agree on a generalization if it decreases. A correlation of manual judge-ment with an automatic measure like the clarity score suggests that clarity scoresalone, without the need of manual assessments, can be good indicators of thenature of information need change, thus suggesting that automatic developmentof user sessions on a large scale could be possible with a little or no manualpost-processing effort.

Acknowledgments

This research is supported by the Science Foundation Ireland (Grant 07/CE/I1142)as part of the Centre for Next Generation Localisation (CNGL) project.

References

1. Bates, M.J.: The Design of Browsing and Berrypicking Techniques for the OnlineSearch Interface. Online Review 13(5) (1989) 407–424

2. Evangelos Kanoulas, Paul Clough, B.C.M.S.: Session track at TREC 2010. In:SIMINT workshop SIGIR ’10, New York, NY, USA, ACM (2010)

3. Dang, V., Croft, B.W.: Query reformulation using anchor text. In: Proceedings ofWSDM ’10, New York, NY, USA, ACM (2010) 41–50

4. Jansen, B.J., Booth, D.L., Spink, A.: Patterns of query reformulation during websearching. J. Am. Soc. Inf. Sci. Technol. 60 (July 2009) 1358–1371

5. Hearst, M.: TextTiling: Segmenting text into multi-paragraph subtopic passages.CL 23(1) (1997) 33–64

6. Choi, F.Y.Y.: Advances in domain independent linear text segmentation. In: Pro-ceedings of the NAACL. (2000) 26–33

7. Cronen-Townsend, S., Croft, W.B.: Quantifying query ambiguity. In: Proceedingsof HLT ’02, San Francisco, USA (2002) 104–109

Crowdsourcing Interactions

Capturing query sessions through crowdsourcing

G. Zuccon∗, T. Leelanupab†, S. Whiting∗,E. Yilmaz?, J. M. Jose∗, and L. Azzopardi∗

∗University of Glasgow, UK; ?Microsoft Research, Cambridge, UK† King Mongkut’s Institute of Technology Ladkrabang, Thailand

Abstract. The TREC evaluation paradigm, developed from the Cran-field experiments, typically considers the effectiveness of information re-trieval (IR) systems when retrieving documents for an isolated query.A step forward towards a robust evaluation of interactive informationretrieval systems has been achieved by the TREC Session Track, whichaims to evaluate retrieval performance of systems over query sessions.Its evaluation protocol consists of artificially generated reformulationsof initial queries extracted from other TREC tasks, and relevance judge-ments made by NIST assessors. This procedure is mainly due to thedifficulty of accessing session logs and because interactive experimentsare expensive to conduct.In this paper we outline a protocol for acquiring user interactions with IRsystems based on crowdsourcing. We show how real query sessions canbe captured in an inexpensive manner, without resorting to commercialquery logs.

1 IntroductionTraditional information retrieval research has focused on developing models andtechniques for maximising the number of relevant documents retrieved for a sin-gle query. In practice however, this is not the case as searching for informationis usually a highly interactive process. This aspect has been long ignored ormarginalised within IR research. Recent developments have however shifted thefocus towards investigating the interactions between users and system duringthe search process, as well as studying how the systems can support such inter-actions (see for example [3]). The ultimate goal is to develop systems that takeinto account all the queries a user issues as well as their interactions occurringduring a search session as to assist them throughout. Within this context, anemerging research thread focuses on how those systems can be experimentallyevaluated. The TREC Session track initiative [4] aims to evaluate IR systemsover query sessions through controlled laboratory experiments, i.e. the “Cran-field evaluation paradigm” (“based on the abstraction of a test collection” [6]: aset of documents, a set of topics and a set of relevance judgements from NISTexperts), as opposed to interactive experiments. In doing so, much of the interac-tion is lost and the evaluation methodology resorts to query reformulations thathave been artificially built [4]. For example, query reformulations of the specifi-cation type are generated by selecting a query and one of its subtopics from theTREC Web diversity task dataset and extracting appropriate keywords.

Crowdsourcing, implemented in a number of web-based platforms such asAmazon Mechanical Turk (AMT)1 and CrowdFlower2, has been used as an in-expensive and often efficient way to conduct large-scale focused studies. Thecrowdsourcing paradigm has been already successfully used in IR for perform-ing a number of tasks. For example, crowdsourcing has been used to gatherrelevance assessments for the TREC 2010 Blog Track [5]. In this paper, we out-line a protocol for capturing user interactions throughout a search session usingcrowdsourcing. The protocol is based on the proposal made by Zuccon et al [7].

Intuitively, the protocol for capturing search sessions information throughcrowdsourcing operates as follows: Workers (i.e. users of the crowdsourcing tool)are asked to complete information seeking tasks within a web-based crowdsourc-ing platform. During the information seeking tasks, workers are assisted by an IRsystem encapsulated within the web-based crowdsourcing platform. While work-ers perform information seeking tasks, researchers can capture logs of workersinteractions with the IR system. Furthermore, researchers have the possibilityto acquire entry and post-search information and statistics, which would help tocharacterise (to some extent) the user population.

2 A protocol for Crowdsourcing Query SessionsIn [7] we outlined a protocol for conducting interactive IR experiments withincrowdsourcing platforms. Information seeking tasks are performed within the fol-lowing context. Crowdsourced workers are provided with a search engine embed-ded into the crowdsourcing platform which assists them acquiring information tosatisfy their information need. Workers can issue an initial query, read documentsnippets, interact with the documents, and issue new queries or query reformu-lations. The protocol then develops around four major aspects: (1) characteriseuser population, (2) define information seeking tasks, (3) capture interactions,(4) acquire post-retrieval information. In this paper, we limit the discussion tothe aspects influencing the acquisition of search session data, and in particularof query sessions. We thus focus on aspects (2) and (3).

2.1 Define Information Seeking Tasks

Information seeking tasks assigned to crowdsourced workers have to be clear andwell defined, as no interaction is possible between workers and requesters. Work-ers are unlikely to perform the cognitive effort required by simulated situationsand information seeking tasks as defined within the literature for laboratorybased interactive IR experiments [1], i.e. by simulating search situations. Thisis because the workers’ main goal is to complete tasks as efficiently and rapidlyas possible. We suggest that in crowdsourced search environments, the topic ofthe search session has to be explicitly stated to workers, together with a numberof specific informational questions they are expected to answer. In preliminaryexperiments conducted to test the validity of the protocol [7], we employed top-ics and questions selected from the TREC Question Answering (QA) Track forthe years 2005, 2006, 2007 [2]. Consider for example topic 279 extracted from

1http://www.mturk.com/

2http://crowdflower.com/

the topic set of the TREC 2007 QA Track: “Australian wines”. With respectto this topic, workers are asked to use the provided search engine to help themanswer the following questions: “What winery produces Yellowtail?” (279.4),“Where does Australia rank in exports of wine?” (279.3), and “Name some ofAustralia’s female winemakers” (279.5).

In [7] we argue that posing questions about a specific topic sufficiently initi-ates the workers’ information needs, thus avoiding the need to create simulatedtasks. The scenario in which the information seeking task is performed is keptstraightforward: workers have to answer a number of questions; to do so theycan search for information using the provided IR system.

Fig. 1: The interactive search interface encapsulated within an AMT HIT.

2.2 Capture Interactions

Once topics and questions are assigned, workers can search for answers using theprovided IR system. The IR system is able to record the interactions betweenthe workers and the system (e.g. issued queries, clicked results, time spent inreading/searching, etc). Crowdsourcing platforms, such as AMT, do not providenative tools for capturing these kind of user interactions. However, several so-lutions can be devised so as to direct workers towards a tool that is controlledby experimenters, and thus records workers’ interactions. For example, a proxyserver could be used to achieve this goal. An alternative solution can be de-veloped as follows: workers are shown the interface of the IR system within aself-contained iFrame positioned in the page of the HIT. Through iFrames, in-teractions could be recorded, making them available for further analysis. Weadopted the latter solution when developing our preliminary experiments. Theplatform of the experimental system is shown in the next section, together withan example of the interactions we were able to capture during our preliminaryexperiments.

3 Crowdsourcing search sessions: an exampleWe embedded a search system within the crowdsourcing platform offered byAMT, using an iFrame within a standard HIT. Our IR system was developedas a web-based front-end of the Microsoft Bing API3 for web results. Each timea user began a search task our system was provided with AMT HIT detailssuch as the work assignment ID and the corresponding question topic. Queries,result clicks and explicit feedback via an optional “Mark as Relevant” buttonwere logged alongside the HIT information. Following completion of the batchof HITs for each experiment we then merged the provided search logs withthe AMT logs to yield a rich source of individual worker data for analysis4.AMT data provided statistics such as the search task duration, question answers,unique worker IDs and qualification scores that can be used to begin explainingbehaviours observed through the related query session logs. A screenshot of thesearch interface encapsulated within an AMT HIT is given in Fig. 1.

Query submitted"about yellow tail"

0s

Doc3 marked Relevant"en.wikipedia.org/wiki/

Yellow_Tail_(wine)"69s

Query submitted"australia's rank #

wine export"

115s

Query submitted"Where does Australia rank

in exports of wine?"

164s

Doc2 clicked"wiki.answers.com/Q/

Who_does_Australia_import_and_export_to"

198s

Query submitted"yellow tail"

331s

Doc3 marked relevant"en.wikipedia.org/wiki/

Yellow_Tail_(wine)"

363s

Query submitted"australia wine"

386s


Australian_wine"

400s

Query submitted"female wine makers

in australia"

331s

Doc2 clicked"thewinedoctor.com/tastingsformal/

femalewinemakers.shtml"

480s

Query submitted"australian wine"

556s

Doc1 marked Relevant"en.wikipedia.org/wiki/

Australian_(wine)"

587s

Query submitted"yellow tail"

627s

Query submitted"yellowtail"

605s


Yellow_Tail_(wine)"

635sQuery submitted

"female winemakers in australia"

667s

Doc2 marked relevant"iwda.com.au/feast/Winemakers/pheiffer.html?keepThis=true&..."

683s

Fig. 2: Search session of a crowdsourced worker for the topic “Australian wines”.

In Fig. 2 we report a session we observed during our experiments. The ac-quired interaction refers to the example topic of section 2.1 (“Australian wines”).During this search session, that lasted about 11 minutes, the worker issued 10queries aiming to solve his information need, i.e. answer the three questions weasked about Australian wines. The worker also observed the snippets of the re-trieved documents, clicked to access some of the documents and marked some ofthese as useful (i.e. relevant) for solving the information task. Within the session,the worker issued all the three types of queries recognised by the TREC SessionTrack: specification (“australia wine”→“female wine makers in australia”), drift-ing (“about yellow tail”→“australia’s rank # wine export”), and generalisation(“yellow tail”→“australia wine”). In Fig. 3a we report the evolution in terms ofquery length of three of the sessions recorded for topic 279.

The statistics regarding query sessions collected during our experiments areas follows. We collected in total 119 queries for 24 HITs (with 58 assignmentscompleted by workers). Users were encouraged to engage with the retrieval sys-tem by promising the award of a bonus payment if they did so: while, their HITwas refused if questions were answered without interacting with the search en-gine. A qualification test based on aptitude tests (as proposed in [7]) was used tocharacterise the user population. Excluding workers not issuing any queries (1

3http://www.bing.com/developers

4Logs and search interface’s screenshots are available at http://www.dcs.gla.ac.uk/~guido/sir2011.

session), workers issued on average 2.26 queries per session (the average lengthof search sessions is about 5.45 minutes). The queries are on average 3.78 termslong, with a maximum of 10 terms (including stopwords). Finally, in Fig. 3b wereport statistics of the interactions we captured during our experiments.

1 2 3 4 5 6 7 8 9 10

0123456789

Session Query

Que

ry L

engt

h (te

rms)

0

3

0 0

2

20

1

01

2

31

Session '1RTQ6...'Session '154PW...'Session '13YS8...'

(a) Evolution in terms of query length of the threesessions recorded for topic 279 of the TREC 2007QA Track. Labels indicate the number of sharedterms between new and previous query.

# of interactions 266Sessions with logged interactions 57

Sessions with 1 click 27Sessions with 2 clicks 12Sessions with 3 clicks 7Sessions with 4 click 5Sessions with 5 click 6

c

(b) Statistics regarding the inter-actions captured during our ex-periments.

Fig. 3

4 Directions of DevelopmentIn this paper we have shown how crowdsourcing can be used to capture searchsessions and in particular query sessions. We have suggested that crowdsourcingallows researchers to obtain search session data, and in particular query sessions,which are more akin to represent real users’ queries than those generated withinthe TREC 2010 Session Track. We also have outlined a protocol that can beused within the TREC Session Track for crowdsourcing real query sessions to beused for evaluating interactive IR systems. Future work will be directed towardsthe consolidation and evaluation of the introduced crowdsourcing protocol, inparticular by comparing the acquired information against that obtained throughlaboratory based experiments and session logs obtained from commercial searchengines. Furthermore, we intend to explore the possibility of fully relying oncrowdsourcing for the evaluation of interactive IR systems.

References1. P. Borlund. The IIR evaluation model: a framework for evaluation of interactive

information retrieval systems. Information Research, 8(3), 2003.2. H. T. Dang, et al. Overview of the TREC 2007 Question Answering Track.

TREC’07, 2007.3. N. Fuhr. A probability ranking principle for interactive information retrieval. JIR,

12(3):251–265, June 2008.4. E. Kanoulas, et al. Session track at trec 2010. In Proc. of SimInt 2010, 2010.5. R. McCreadie, et al. Crowdsourcing blog track top news judgments at trec. In Proc.

of CSDM 2011, 2011.6. E. M. Voorhees. Trec: Improving information access through evaluation. Bulletin of

the American Society for Information Science and Technology, 32(1):16–21, 2005.7. G. Zuccon, et al. Crowdsourcing Interactions: A proposal for capturing user inter-

actions through crowdsourcing. In Proc. of CSDM 2011, pages 35–39, 2011.

SIRâ€™11: Information Retrieval Over Query Sessions

Documents