SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

SearchInFocusExploratory Study on Query Logs and Actionable

Intelligence

Marina Santini

Exploratory Query-log Analysis WorkshopOrganized by Findwise, AB - www.findwise.com/

Thursday, October 25, 2012 from 10:00 AM to 12:00 PM (CEST)Lund, Sweden

SLTC 2012: Fourth Swedish Language Technology Conference, October 24-26, 2012, Lund.

LAST UPDATED: 26 OCTOBER 2012

http://www.findwise.com/





Query Logs and Actionable Intelligence:Questions to LinkedIn-ers

• “Can anyone suggest references about mining query logs for BI and CEM?” (3rd May 2012) [BI=Business Intelligence; CEM=Customer Experience Management]

• Applying Findability to Mine Query Logs for BI: Preliminaries “How can I profitably use query logs for making better business decisions and predict future trends?” (14th May 2012)

• Mining Query Logs: Query Disambiguation & Understanding through a KB “some linguistic problems can be sorted out -- for example those related to sublanguage, terminology, multi-word expressions, etc. -- through a dictionary-shaped knowledge base where the different uses of language are stored and continually updated. I will call this knowledge base DaisyKB” (21st May 2012)

http://www.forum.santini.se/2012/05/mining-query-logs-for-bi-and-cem/

http://www.forum.santini.se/2012/05/mining-query-logs-for-bi-and-cem/

http://www.forum.santini.se/2012/05/applying-findability-to-mine-query-logs-for-bi-preliminaries/



http://www.forum.santini.se/2012/05/mining-query-logs-query-disambiguation-understanding-through-a-kb/

http://www.forum.santini.se/2012/05/mining-query-logs-query-disambiguation-understanding-through-a-kb/

My preliminary reflections based on this info…

• “The average length of a search query was 2.4 terms"

• "A 2005 study of Yahoo's query logs revealed 33% of the queries from the same user were repeat queries and that 87% of the time the user would click on the same result. This suggests that many users use repeat queries to revisit or re-find information. This analysis is confirmed by a Bing search engine blog post telling about 30% queries are navigational queries."

• “… much research has shown that query term frequency distributions conform to the power law, or long tail distribution curves. That is, a small portion of the terms observed in a large query log (e.g. > 100 million queries) are used most often, while the remaining terms are used less often individually."

• “… in a recent study in 2011 it was found that the average length of queries has grown steadily over time and average length of non-English languages queries had increased more than English queries."

Then came the corpus…

• Enterprise query logs: VGR (27 August 2012) – easier to handle and interpret than general-

purpose search engines’ query logs!

So… that’s the Outline

1. The query log genre2. Actionable Intelligence3. A possible use case4. Preliminary conclusions

What is a (textual) genre?

• Simply simply simply put:– A genre is a class of text

What characterize a genre?

1. Must have a name2. Must be recognized within a community3. Must be produced during a task4. Must have conventions5. Must raise expectations6. Can change over time. It is an cultural

artifact (culture here includes society, media, techonology, etc.)

Genre Characterization1. Name formation: a genre must indicate a class, a family (for genre name

formation, see Görlach, 2004). Recent webgenres: blogs, tweets, chatlogs, etc.2. Community: a genre is not something individual. A genre is a textual form that

is used and recognized by a community (vs. style can be individualized). Ex: Blogs bloggers and blog readers; academic home pages academics; etc.)

3. Task: a genre meets a RECURRENT communication need. Ex: personal home page genre tells us something about a person; a technical blog is informative about a specific technology; etc.)

4. Conventions: ex : a personal blog is made of posts organized in chronological order where a blogger communicates personal and subjective views on some facts.

5. Expectations: when reading a personal blog, readers expect to read something personal (personal facts or personal opinions) and expect the possibility to leave a comment if they wish to do so.

6. A genre is a cultural artifact: it might evolve over time (see the History of Blog by Rebecca Blood, 2000) might disappear if the society changes (ex : Chansons des gestes). New genres emerge with new media, new technologies, new information needs.

The query log genre is…a novel and fully-emerged webgenre1. Name: in line with other digital genres (ex: web log

blog)2. Community: internet users, IR practitioners3. Task: information needs specified in a search

engine4. Conventions: short texts written in”keywordese”5. Expectations: to find relevant information6. Cultural artifact: a product of our media-based,

internet-based society OR a subproduct of search engines

The query log genre: Languistic and Textual Conventions

• Length: short text (a query log can be seen as a corpus of very short texts, shorter than tweets, mobile text messages, chat logs, etc.)

• Sublanguage/Jargon: ”keywordese”• Register: neutral• Morphology: LITTLE• Syntax : OCCASIONALLY (usually no articles, no

prepositions, no subclauses, etc.)

Query Log Genre: The Benefits

• Expressed in a ”lean” sublanguage, the keywordese: – reduced morphology– reduced syntax– short texts– Mostly Nouns and Verbs

• Reduced size: compare a 2-years collection of emails vs a 2-year collection of query logs

• = REDUCED SIZE, REDUCED PRE-PROCESSING; NO DATA CLEANING!

Expectations: a text written by a user for a search engine to find relevant information

• The texts (queries) must express information needs aka users’ intents

• It is good practice to be cautious with the interpretation of users’ intents. However…

• If we mine query logs with a simple quantitative approach, it is possible to extract recurrent information needs and build upon them…

Actionable Intelligence

• It must be accurate, and verifiably• It must be timely• It must be comprehensive• It must be comprehensible• ability to act on that information straightaway

I would argue:a Query Log is an ”Actionable” Corpus

• Let’s see…

Mining query logs for actionable intelligence: Description and Basic Statistics

• Corpus Time frame: 2010-2011 (2 years)

• “These logs come from the search at hittavard.vgregion.se. The biggest bulk should come from 1177.se. The rest should be from vgregion.se. The target audience are both VGR (Västra Götalands Region) users/employees as well as the general public, as it is a public site. The internal files aresearches made from within the VGR…”

• Corpus size:– size = 3,167 KB (only queries) (BIG DATA is usually > 1TB)– number of queries = 249,243– number of words = 306,453

• Average query length: 1.23 words

http://hittavard.vgregion.se/

http://1177.se/

http://vgregion.se/

Case study enterprise search – VGR

FINDWISE SLIDESHARE: http://www.slideshare.net/findwise/case-study-enterprise-search-vgr

http://www.vgregion.se/en/Vastra-Gotalandsregionen/Home/

http://www.slideshare.net/findwise/case-study-enterprise-search-vgr

http://www.slideshare.net/findwise/case-study-enterprise-search-vgr


Business Decision: Improve Search Quality and Usability to

increase Users’ Satisfaction & Competitiveness

How?

• The simplest approach…

ANALIZE THE HEAD

(1) Take the Top-Ranked Queries

(2) Use them as TAGS (metadata creation)

1. egenremiss2. mina vårdkontakter3. webbisar4. sjukresor5. vårdgaranti6. sjukresa7. mammografi8. vårdval9. influensa10. urinvägsinfektion11. halsfluss12. förnya recept13. magkatarr14. vattkoppor15. byta vårdcentral16. blanketter17. svinkoppor18. reseersättning19. klamydia20. feber21. högkostnadsskydd22. vinterkräksjukan23. patientombudsman24. öroninflammation25. logga in26. frikort27. hosta28. magsjuka29. njursten30. als

Tags are keywords describing the content

3) Use TAG metadata to automatically annotate only documents selected by users

Watch out!

Mismatch or Ambiguity?

4) Use most frequent queries to create a query suggester

5) If you want, you can sort queries automatically into query types and build…• a taxonomy

• The categories of the taxonomy can be also used to annotate existing documents automatically (another layer of METADATA)– TAGS describe the content– CATEGORIES IN A TAXONOMY organize the content– Categories can be hierarchical whereas tags cannot

If you want, you can give the taxonomy to document creators, so they can annotate the text with metadata

• … in short you will have a multilabelled corpus that can be used with machine learning.

The importance of metadata to structure unstructured data & to extract actionable

intelligence• From Unstructured Data to Actionable Intelligence by

Ramana Rao, 2003

• ” We access information for various purposes and in various ways according to our purpose. Sometimes we’re surveying an area of knowledge, trying to get a general understanding of what it’s about or what’s available. At other times we’re searching for specific answers. […] It is this range of purpose and context that we can better address by providing a richer set of information access tools based on exploiting metadata.”

http://www.ramanarao.com/papers/rao-itpro-2003-11.pdf







Linguistic Remarks

• At the top of the frequency list:– Nouns– Compounds– A+N– V+N

• More complex constructions at the bottom

Syntactic Patterns

In this case, automatic annotation can help a lot

Benefit for the Search Provider

• Mining query logs to extract user-created knowlege, ie queries that can be used as tags (metadata)

• Quickly create domain-specific taxonomies you can capitalize upon, especially for new client companies working in related fields

• Enhancements of current search products• Inexpensive creation of annotated corpora: document

annotation through query logs is a simple technique that in the a short time will build massive annotated corpora to use for machine learning, which will allow more sophisticated search refinements.

Benefits for Clients & End Users

• Somebody said: SEARCH MUST BE MIND READER!• BUT ALSO faster, more friendly, more exhaustive

and more accurate.• If this happens, clients will spend less for customer

care. If you find what you need online, there is no need to call an helpdesk or customer care service.

Query Pre-processing ?

Absolutely YES• Normalization

– egen remiss & egenremiss– Spelling correction

• Terminology expansion (domain-dependent)– anemi & blodbrist (ex: taken

from Freberg Heppin, 2010; ex: painkiller & analgesic)

– Stemming/Lemmatization (blanketter blankett; sjukresor sjukresa)

If you want… Nj• Compound decomposition &

Tokanization. Text chunks (such as queries) are more informative and less ambiguous than single words. No need to tokenize or decompose, if RECALL is ok.

• Ontology? Uhm.. not sure we need a semantic structure here….

Tokinization ? Domain-dependent?

Top query frequencies• 21388 egenremiss• 17360 mina vårdkontakter• 10553 webbisar• 8787 sjukresor• 7345 vårdgaranti• 3938 sjukresa• 3734 mammografi• 3723 vårdval• 3653 influensa• 2908 urinvägsinfektion• 2803 halsfluss• 2542 förnya recept• 2460 magkatarr• 2394 vattkoppor• 2274 byta vårdcentral• 2256 blanketter• 1878 svinkoppor• 1840 reseersättning• 1653 klamydia• 1559 feber• 1525 högkostnadsskydd• 1420 vinterkräksjukan• 1405 patientombudsman• 1326 öroninflammation• 1252 logga in• 1251 frikort• 1199 hosta• 1193 magsjuka• 1184 njursten• 1167 als

Top word frequencies• 21565 egenremiss• 17717 vårdkontakter• 17407 mina• 10567 webbisar• 8880 sjukresor• 7357 vårdgaranti• 4044 sjukresa• 3763 vårdcentral• 3754 mammografi• 3732 influensa• 3730 vårdval• 2932 urinvägsinfektion• 2819 halsfluss• 2805 recept• 2543 förnya• 2463 magkatarr• 2413 vattkoppor• 2349 i• 2296 byta• 2269 blanketter• 1881 svinkoppor• 1840 reseersättning• 1802 feber• 1666 klamydia• 1571 högkostnadsskydd• 1422 vinterkräksjukan• 1405 patientombudsman• 1383 hepatit• 1338 öroninflammation• 1331 frikort

Different search users’ behaviour:Enterprise vs. Web?

VGR: Swedish – Enterprise SearchFörsta hjälpen till psykisk hälsa (MHFA-Sverige) Swedish – Web Search


http://www.mhfa.se/

http://www.mhfa.se/

http://www.mhfa.se/

Preliminary reflections revisited…• “The average length of a search query was 2.4 terms“… uhm.. It depends: enterprise vs. web

• "A 2005 study of Yahoo's query logs revealed 33% of the queries from the same user were repeat queries and that 87% of the time the user would click on the same result. This suggests that many users use repeat queries to revisit or re-find information. This analysis is confirmed by a Bing search engine blog post telling about 30% queries are navigational queries.“ not investigated

• "much research has shown that query term frequency distributions conform to the power law, or long tail distribution curves. That is, a small portion of the terms observed in a large query log (e.g. > 100 million queries) are used most often, while the remaining terms are used less often individually.“ … definitely yes

• "in a recent study in 2011 it was found that the average length of queries has grown steadily over time and average length of non-English languages queries had increased more than English queries.“uhm.. It depends : enterprise vs. web + language

Conclusions from this Exploration• Query logs are a genre that is easier to exploit for extracting

actionable intelligence.

• Query logs are a good, handy and economic source of information for actionable business decisions, such as:– keeping a cutting-edge profile on the market, – enhancing enterprise search usability (query suggester/autofill), – disambiguation, – annotation and taxonomy creation– preventing huge cost for customer helpdesk and similar services throught

a cutting-edge search functionality!

• Future: More and diversified use cases…

QUESTIONS?

THANK YOU FOR YOUR ATTENTION

SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

Documents

search query

mining query logs

large query log

query log genre2

study of yahoos query

year collection of query

corpus enterprise query

textual genre