Automatic Query Expansion in Information Retrieval

BY RYAN HERBECK

Automatic Query Expansion in Information Retrieval

What is Automatic Query Expansion (AQE)?

“A process which consists of selecting and adding terms to the user's query with the goal of minimizing query-document mismatch and thereby improving retrieval performance.”

Takes a user’s original query and selects and adds related words to it

Used to increase effectiveness of relevant document retrieval in information retrieval systems

Current Information Retrieval (IR) Systems

Standard interface (one textbox, accepts keywords)

Keywords matched against keyword collection

Results are sorted and returnedUsing multiple topic-specific keywords

returns quality resultsIssues:

User queries are usually short Natural language is ambiguous Prone to errors and omissions as a result

Vocabulary Problem

System indexers and users often use different words “Saltines” and “crackers”

Synonymy: same word, different meanings “Java,” “Ruby”

Polysemy: different words, similar meaning “TV” and “television,” “CD” and “compact disk”

Synonymy + word inflections => decrease in recall Recall: ability to retrieve all relevant documents

Polysemy => decrease in precision Precision: ability to retrieve only relevant documents

Proposed Solutions

Interactive query refinementRelevance feedbackWord sense disambiguationSearch results clusteringAQE

Early AQE

Suggested as early as 1960Investigated a variety of techniques

Vector feedback Term-term clustering Comparative analysis of term distributions

Experimented on small scale collectionsYielded inconclusive results about

effectivenessGain in recall was often compensated by loss

in precision

Queries Today

Volume of data has increased significantlyNumber of terms in a user’s query has

remained low 2009: average query length was 2.30 words; same as

in 1999Most common queries are 1-3 words in lengthVocabulary problem is worse

Scarcity of query terms reduces synonymy handling Diversity and size of data increases effects of

polysemyThe need for and scope of AQE have

increased

Applications of AQE

Question Answering Goal: Provide direct responses as opposed to whole

documents Expand question with related terms expected to be

found in documents with answersMultimedia Information Retrieval

IR systems search over metadata (annotations, captions, etc.)

When no metadata exists, IR systems use content analysis which can be combined with AQE techniques Automatic speech recognition, visual features

Applications of AQE

Information Filtering Monitor a stream of documents and select relevant

ones Documents arrive continuously (e-news, blogs, e-mail,

etc.)Cross-Language Information Retrieval

Retrieve documents in a language differing from the query

Issues: Insufficient language coverage Untranslatable terms Translation ambiguity

Related Techniques

Interactive Query RefinementRelevance FeedbackWord Sense DisambiguationSearch Results Clustering

Interactive Query Refinement (IQE)

Example: Google SuggestSystem suggests several formulations of the

queryDecision of query formulation made by userDoes not handle feature selection and query

reformulation issuesPotential for producing better results than

AQE, but requires user expertise

http://google.com/

Relevance Feedback

Returns initial query resultsReceives user feedback about the relevancy

of resultsPerforms a new query based on user

feedbackMakes the new query more similar to the

relevant documents retrieved, whereas AQE forms a query more similar to the user’s intentions

Data sources of relevance feedback may have more reliability than that of AQE

Word Sense Disambiguation (WSD)

Identifies word meanings in contextApproaches

Represent words by their text definitions Use of WordNet

English lexical database which groups words into synonym subsets (synsets), gives general definitions and records semantic relations between synsets

Find all of a word’s contexts and cluster similar onesComputational and effectiveness limitationsTypical queries may be too short for WSD

Example: “CD”

Search Results Clustering (SRC)

Organizes and groups search results by topicAttempts to optimize clustering structure and

label qualityLabels could be seen as query refinements,

but intended to help the user browse through results

Example: http://clusty.com

http://clusty.com/

How AQE Works

Data PreprocessingFeature Generation and RankingFeature SelectionQuery Reformulation

Data Preprocessing

Reformat data source for more effective subsequent processing

Index the collection of documents and run the query against the collection index1. Extract text from documents2. Extract words without punctuation and ignoring case3. Remove articles and prepositions4. Reduce word inflections and derivations5. Assign a weighted importance value to each word

Data Preprocessing

Example: HTML:

‘<b>Automatic query expansion</b> expands queries automatically.’

Indexed representation (weight determined by frequency): automat 0.33, queri 0.33, expan 0.16, expand 0.16

Each document is represented as a collection of weighted terms

Feature Generation and Ranking

Input: original query, transformed data source

Output: set of candidate expansion features (terms that could be added to the original query)

Original query may be preprocessed to have common words removed and/or important words extracted

Techniques: One-to-One Associations One-to-Many Associations Analysis of Top-Ranked Documents Query Language Modeling


One-to-One Associations Between expansion features and query terms

One feature is related to one query term One or more features are generated and ranked for

each term Approaches

Stemming algorithm: reduces words to root form WordNet: synonym sets (synsets), records semantic

relations Prevents ambiguity (select one synset for one query

term) Compute term-to-term similarities in a document

collection Mine user query logs


One-to-Many Associations One feature is related to one or more query terms Approaches

Extend one-to-one association techniques to other query terms Generate a term if it is related to more than one term Filters weakly related features

Combine multiple relationships between term pairs Construct term network for the query Network contains word pairs linked by relations

(synonyms, stems, etc.)


Analysis of Top-Ranked Documents Retrieve top results for original query Generate expansion features from related terms in

these documents Features are related to query as a whole, as opposed

to individual query terms Approach: Pseudo-Relevance Feedback

Score each term in top documents by a applying a weighting function to the whole collection of documents

Sum up all weights of each term and sort the terms based on sums

Issue: weights reflect importance over collection more than importance over query


Query Language Modeling Generate probability distribution over query terms

Best features have the highest probabilities Approaches:

Mixture Model Builds a model from top-ranked documents collection Extracts the most distinct part from the document

collection Use an expectation-maximization algorithm to get

probabilities Relevance Model

Builds a model from top-ranked documents individually Documents further down the list have less and less

influence on word probabilities

Feature Selection

Select top features for query expansionFeatures are not evaluated further, simply

selected based on rankLimited number of features selected for rapid

processingUsing all features is not necessarily better

than using only a fewTypically select 10-30 featuresCould select features only within a certain

rank range

Query Reformulation

Modify the original query by adding selected features to it and perform search

Approaches: Query reweighting: assign a weight to each feature

using a weighting formula Simply add selected features to the original query

without weighting

Classification of AQE Techniques

Linguistic AnalysisCorpus-Specific Global TechniquesQuery-Specific Local TechniquesSearch Log AnalysisWeb Data

Linguistic Analysis

Focus on morphological, lexical, syntactic and semantic relationships for expansion

Analysis based on dictionaries, thesauri, or sources such as WordNet

Sensitive to word sense ambiguityExamples:

Stemming algorithm: reduce terms to root form Ontology browsing: paraphrase user’s query in

context Syntactic analysis: extract relations between terms to

find features that appear in related relations

Corpus-Specific Global Techniques

Corpus: large structured set of textsAnalyze contents of a full database to find

features used similarly Find correlations between term pairs at document

level or within paragraphs or sentencesData-driven

May not have a simple interpretation

Query-Specific Local Techniques

Utilize local context provided by the queryMake use of top-ranked documentsExamples:

Analysis of feature distribution difference Model-based AQE Top-document preprocessing

Removes irrelevant features before using term-ranking function

Search Log Analysis

Mines users’ search logs for implicit query associations Search logs contain queries and URLs of clicked pages Example: user searches “apple,” find a past query

“iPhone”May encode implicit relevance feedback instead

of retrieval feedbackExamples:

Extract features from past related queries that are related to the current query

Use top documents from past related queries Extract terms directly from visited documents

Web Data

Use of anchor texts to generate features Anchor text: visible, clickable text of a hyperlink

Most anchor texts are similar to real user queries Anchor texts typically describe contents of the

documentIssues:

“click here” One-word/short anchor texts

Use of Wikipedia documents and hyperlinks

http://uwplatt.edu/csse

Critical Issues

Parameter SettingEfficiencyUsability

Parameter Setting

Rely on several parameters Number of pseudo-relevant documents Number of expansion terms Variables within term-ranking and weighting functions

Could use fixed values for key parameters Fixed values may not work well for all queries

Efficiency

Need to deliver real-time results to a large volume of users

Balancing performance time with quality results

Good AQE is computationally expensive Execution time of expanded query

Usability

Implementation hidden to usersUsers may receive high-ranked documents

that contain none of the query terms Query terms substituted entirely by synonyms Irrelevant document contains query terms in anchor

textCould increase user control

Show user the features used Allow user to revise expanded query

AQE is better for non-expert users

Conclusions

No perfect solution for the vocabulary problem

Overcomes user reluctance and difficulty to provide better refined queries to meet their needs

Variety of implementationsEfficiency is gradually increasingNear the end of its experimental stageNot yet ready to be implemented on large-

scale IR systems such as web search engines

Questions?

References

Carpineto, C. and Romano, G. 2012. A survey of automatic query expansion in information retrieval. ACM Comput. Surv. 44, 1, Article 1 (January 2012), 50 pages. Furnas, G. W., Landauer, T. K., Gomez, L. M., and Dumais, S. T. 1987. The vocabulary problem in human-system communication. Comm. ACM 30, 11, 964-971. Mitra, M., Singhal, A., and Buckley, C. 1998. Improving automatic query expansion. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, 206-214. Vechtomova, O. 2009. Query expansion for information retrieval. In Encyclopedia of Database Systems, L. Liu and M. T. Özsu Eds., Springer, 2254-2257.