SearchInFocus Exploratory Study on Query Logs and Actionable Intelligence Marina Santini Exploratory Query-log Analysis Workshop Organized by Findwise , AB - www.findwise.com / Thursday, October 25, 2012 from 10:00 AM to 12:00 PM (CEST) Lund, Sweden SLTC 2012: Fourth Swedish Language Technology Conference, October 24-26, 2012, Lund. LAST UPDATED: 26 OCTOBER 2012
38
Embed
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence
Query logs are an important source of information to surmize users intents'. Although Karlgren (2010) points out that “There are several reasons to be cautious in drawing too far-reaching conclusions: we cannot say for sure what the users were after; [...]“, some linguistic problems could be sorted out by applying more advanced text/content analytics, such as register/sublanguage identification and terminology classification (see Friberg Heppin, 2011) . In this presentation, I will argue that query logs can be considered a digital textual genre alike emails, blogs, chats, tweets and so forth. All these genres contain unstructured information that, still today, is difficult to leverage upon satisfactorily. The hypothesis that I would like to put forward in this workshop is that query logs might be easier to exploit to extract useful information and actionable intelligence than other digital genres.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
SearchInFocusExploratory Study on Query Logs and Actionable
Intelligence
Marina Santini
Exploratory Query-log Analysis WorkshopOrganized by Findwise, AB - www.findwise.com/
Thursday, October 25, 2012 from 10:00 AM to 12:00 PM (CEST)Lund, Sweden
SLTC 2012: Fourth Swedish Language Technology Conference, October 24-26, 2012, Lund.
Query Logs and Actionable Intelligence:Questions to LinkedIn-ers
• “Can anyone suggest references about mining query logs for BI and CEM?” (3rd May 2012) [BI=Business Intelligence; CEM=Customer Experience Management]
• Applying Findability to Mine Query Logs for BI: Preliminaries “How can I profitably use query logs for making better business decisions and predict future trends?” (14th May 2012)
• Mining Query Logs: Query Disambiguation & Understanding through a KB “some linguistic problems can be sorted out -- for example those related to sublanguage, terminology, multi-word expressions, etc. -- through a dictionary-shaped knowledge base where the different uses of language are stored and continually updated. I will call this knowledge base DaisyKB” (21st May 2012)
• “The average length of a search query was 2.4 terms"
• "A 2005 study of Yahoo's query logs revealed 33% of the queries from the same user were repeat queries and that 87% of the time the user would click on the same result. This suggests that many users use repeat queries to revisit or re-find information. This analysis is confirmed by a Bing search engine blog post telling about 30% queries are navigational queries."
• “… much research has shown that query term frequency distributions conform to the power law, or long tail distribution curves. That is, a small portion of the terms observed in a large query log (e.g. > 100 million queries) are used most often, while the remaining terms are used less often individually."
• “… in a recent study in 2011 it was found that the average length of queries has grown steadily over time and average length of non-English languages queries had increased more than English queries."
Then came the corpus…
• Enterprise query logs: VGR (27 August 2012) – easier to handle and interpret than general-
purpose search engines’ query logs!
So… that’s the Outline
1. The query log genre2. Actionable Intelligence3. A possible use case4. Preliminary conclusions
What is a (textual) genre?
• Simply simply simply put:– A genre is a class of text
What characterize a genre?
1. Must have a name2. Must be recognized within a community3. Must be produced during a task4. Must have conventions5. Must raise expectations6. Can change over time. It is an cultural
artifact (culture here includes society, media, techonology, etc.)
Genre Characterization1. Name formation: a genre must indicate a class, a family (for genre name
formation, see Görlach, 2004). Recent webgenres: blogs, tweets, chatlogs, etc.2. Community: a genre is not something individual. A genre is a textual form that
is used and recognized by a community (vs. style can be individualized). Ex: Blogs bloggers and blog readers; academic home pages academics; etc.)
3. Task: a genre meets a RECURRENT communication need. Ex: personal home page genre tells us something about a person; a technical blog is informative about a specific technology; etc.)
4. Conventions: ex : a personal blog is made of posts organized in chronological order where a blogger communicates personal and subjective views on some facts.
5. Expectations: when reading a personal blog, readers expect to read something personal (personal facts or personal opinions) and expect the possibility to leave a comment if they wish to do so.
6. A genre is a cultural artifact: it might evolve over time (see the History of Blog by Rebecca Blood, 2000) might disappear if the society changes (ex : Chansons des gestes). New genres emerge with new media, new technologies, new information needs.
The query log genre is…a novel and fully-emerged webgenre1. Name: in line with other digital genres (ex: web log
blog)2. Community: internet users, IR practitioners3. Task: information needs specified in a search
engine4. Conventions: short texts written in”keywordese”5. Expectations: to find relevant information6. Cultural artifact: a product of our media-based,
internet-based society OR a subproduct of search engines
The query log genre: Languistic and Textual Conventions
• Length: short text (a query log can be seen as a corpus of very short texts, shorter than tweets, mobile text messages, chat logs, etc.)
• Sublanguage/Jargon: ”keywordese”• Register: neutral• Morphology: LITTLE• Syntax : OCCASIONALLY (usually no articles, no
prepositions, no subclauses, etc.)
Query Log Genre: The Benefits
• Expressed in a ”lean” sublanguage, the keywordese: – reduced morphology– reduced syntax– short texts– Mostly Nouns and Verbs
• Reduced size: compare a 2-years collection of emails vs a 2-year collection of query logs
• = REDUCED SIZE, REDUCED PRE-PROCESSING; NO DATA CLEANING!
Expectations: a text written by a user for a search engine to find relevant information
• The texts (queries) must express information needs aka users’ intents
• It is good practice to be cautious with the interpretation of users’ intents. However…
• If we mine query logs with a simple quantitative approach, it is possible to extract recurrent information needs and build upon them…
Actionable Intelligence
• It must be accurate, and verifiably• It must be timely• It must be comprehensive• It must be comprehensible• ability to act on that information straightaway
I would argue:a Query Log is an ”Actionable” Corpus
• Let’s see…
Mining query logs for actionable intelligence: Description and Basic Statistics
• Corpus Time frame: 2010-2011 (2 years)
• “These logs come from the search at hittavard.vgregion.se. The biggest bulk should come from 1177.se. The rest should be from vgregion.se. The target audience are both VGR (Västra Götalands Region) users/employees as well as the general public, as it is a public site. The internal files aresearches made from within the VGR…”
• Corpus size:– size = 3,167 KB (only queries) (BIG DATA is usually > 1TB)– number of queries = 249,243– number of words = 306,453
3) Use TAG metadata to automatically annotate only documents selected by users
Watch out!
Mismatch or Ambiguity?
4) Use most frequent queries to create a query suggester
5) If you want, you can sort queries automatically into query types and build…• a taxonomy
• The categories of the taxonomy can be also used to annotate existing documents automatically (another layer of METADATA)– TAGS describe the content– CATEGORIES IN A TAXONOMY organize the content– Categories can be hierarchical whereas tags cannot
If you want, you can give the taxonomy to document creators, so they can annotate the text with metadata
• … in short you will have a multilabelled corpus that can be used with machine learning.
The importance of metadata to structure unstructured data & to extract actionable
intelligence• From Unstructured Data to Actionable Intelligence by
Ramana Rao, 2003
• ” We access information for various purposes and in various ways according to our purpose. Sometimes we’re surveying an area of knowledge, trying to get a general understanding of what it’s about or what’s available. At other times we’re searching for specific answers. […] It is this range of purpose and context that we can better address by providing a richer set of information access tools based on exploiting metadata.”
• At the top of the frequency list:– Nouns– Compounds– A+N– V+N
• More complex constructions at the bottom
Syntactic Patterns
In this case, automatic annotation can help a lot
Benefit for the Search Provider
• Mining query logs to extract user-created knowlege, ie queries that can be used as tags (metadata)
• Quickly create domain-specific taxonomies you can capitalize upon, especially for new client companies working in related fields
• Enhancements of current search products• Inexpensive creation of annotated corpora: document
annotation through query logs is a simple technique that in the a short time will build massive annotated corpora to use for machine learning, which will allow more sophisticated search refinements.
Benefits for Clients & End Users
• Somebody said: SEARCH MUST BE MIND READER!• BUT ALSO faster, more friendly, more exhaustive
and more accurate.• If this happens, clients will spend less for customer
care. If you find what you need online, there is no need to call an helpdesk or customer care service.
Query Pre-processing ?
Absolutely YES• Normalization
– egen remiss & egenremiss– Spelling correction
• Terminology expansion (domain-dependent)– anemi & blodbrist (ex: taken
from Freberg Heppin, 2010; ex: painkiller & analgesic)
Tokanization. Text chunks (such as queries) are more informative and less ambiguous than single words. No need to tokenize or decompose, if RECALL is ok.
• Ontology? Uhm.. not sure we need a semantic structure here….
Preliminary reflections revisited…• “The average length of a search query was 2.4 terms“… uhm.. It depends: enterprise vs. web
• "A 2005 study of Yahoo's query logs revealed 33% of the queries from the same user were repeat queries and that 87% of the time the user would click on the same result. This suggests that many users use repeat queries to revisit or re-find information. This analysis is confirmed by a Bing search engine blog post telling about 30% queries are navigational queries.“ not investigated
• "much research has shown that query term frequency distributions conform to the power law, or long tail distribution curves. That is, a small portion of the terms observed in a large query log (e.g. > 100 million queries) are used most often, while the remaining terms are used less often individually.“ … definitely yes
• "in a recent study in 2011 it was found that the average length of queries has grown steadily over time and average length of non-English languages queries had increased more than English queries.“uhm.. It depends : enterprise vs. web + language
Conclusions from this Exploration• Query logs are a genre that is easier to exploit for extracting
actionable intelligence.
• Query logs are a good, handy and economic source of information for actionable business decisions, such as:– keeping a cutting-edge profile on the market, – enhancing enterprise search usability (query suggester/autofill), – disambiguation, – annotation and taxonomy creation– preventing huge cost for customer helpdesk and similar services throught