Introduction to Machine Learning for Information Retrieval Xiaolong Wang.

Introduction to Machine Learning for Information Retrieval

Xiaolong Wang

What is Machine Learning

• In short, tricks of maths• Two major tasks:– Supervised Learning: • a.k.a. Regression, Classification…

– Unsupervised Learning:• a.k.a. data manipulation, clustering …

Supervised Learning

• Label ： usually manually labeled• Data ： data representation, usually as a vector• Prediction Function ： selecting one from a predefined family

of functions that has the best prediction

classification regression

Supervised Learning

• Two formulations:– F1: Given a set of Xi, Yi, learn a function• Yi

– Binary: Spam v.s. Non-spam– Numeric: Very relevant(5), somewhat relevant(4), marginal

relevant(3), somewhat irrelevant(2), very irrelevant(1)

• Xi

– Number of words, occurrence of each word, …

• f– usually linear function

Supervised Learning

• Two formulations:– F2: Give a set of Xi, Yi ,learn a function such

that• Yi: more complex label than binary or numeric

– Multiclass learning: entertainment v.s. sports v.s. politics…– Structural learning: syntactic parsing

more general

Y

X

Supervised Learning

• Training– Optimization:• Loss: difference b/w true label Yi and predicted label wTXi

– Squared Loss (regression): (Yi – wTXi)2

– Hinge Loss (classification): max(0, 1 – Yi .wTXi)

– Logistic Loss (classification): log(1 + exp(-Yi .wTXi))

Supervised Learning

• Training– Optimization:• Regularization:

Without regularization: overfitting

Supervised Learning

• Training– Optimization:• Regularization:

Large margin, small ||w||

Supervised Learning

• Optimization:– Art of maximization• Unconstraint:

– First order: Gradient descent– Second order: Newtonian method– Stochastic: stochastic gradient descent (SGD)

• Constraint:– Active set method– Interior Point Method– Alternative Direction Method of Multiplier (ADMM)

Unsupervised Learning

• Clustering: – PCA– kNN

Machine Learning for Information Retrieval

• Learning to Rank• Topic Modeling

Learning to Rank

http://research.microsoft.com/en-us/people/hangli/li-acl-ijcnlp-2009-tutorial.pdf

Learning to Rank

• X = (q, d)– Features: e.g. Matching between Query and Document

Learning to Rank

Learning to Rank

• Labels:– Pointwise: relevant vs. irrelevant; 5,4,3,2,1– Pairwise: doc A > doc B, doc C > doc D– Listwise: permutation

• Acquisition:– Expert Annotation– Clickthrough: click ,skip above

Learning to Rank

Learning to Rank

• Prediction function:– Extract Xq,d from (q, d)

– Ranking document by sorting wT Xq,d

• Loss function:– Pointwise– Pairwise– Listwise

Learning to Rank

• Pointwise:– Regression: Square loss

• Pairwise:– Classification: (q, d1) > (q, d2) => positive example Xq,d1 – Xq, d2

• Listwise:– Optimization: NDCG@j

Relevance (0/1) of document at rank i

Discount of rank i

Cumulative

Gain

Normalized

Topic Modeling

• Topic Modeling– Factorization of Words * Documents matrix

• Clustering of document– Project documents (vector of # vocabulary) into lower dimension (vector

of # topics)

• What is Topic?– Linear combination of words

• Nonnegative weights, sum to 1 => probability

Topic Modeling

• Generative models: story-telling– Latent Semantic Analysis, LSA– Probabilistic Latent Semantic Analysis, PLSA– Latent Dirichlet Allocation, LDA

Topic Modeling

• Latent Semantic Analysis (LSA): – Deerwester et al (1990)– Singular Value Decomposition (SVD) applied to

words * documents matrix

– How to interpret negative values?

Topic Modeling• Probabilistic Latent Semantic Analysis (PLSA):

– Thomas Hofmann (1999)– How words/documents are generated (as described by probability)

d1, fish

d1, boat

d1, voyage

d2, voyage

d2, sky

d3, trip

……

documents

topics

topics

wor

ds

documents

documents

Maximal Likelihood:

Topic Modeling• Latent Dirichlet Allocation (LDA)

– David Blei et al. (2003)– PLSA with a Dirichlet prior

• What is Bayesian inference? Conjugate Prior? Posterior? Frequentist v.s. Bayesian Tossing a Coin

Parameter to be estimated priorlikelihood

Posterior probability

• Canonical Maximal Likelihood (Frequentist) as a special form of Bayesian Maximal a Posterior (MAP) when g(r) is uniform prior• Bayesian as an inference method:

• Estimate r: posterior mean, or MAP• Estimate new toss to be head:

Topic Modeling• Latent Dirichlet Allocation (LDA)

– David Blei et al. (2003)– PLSA with a Dirichlet prior

• What additional info we know about ?– Sparsity:

• each topic has nonzero probability on few words;• each document has nonzero probability on few topics;

Dirichlet distribution defines probability on simplex

documents

topics

topics

wor

ds

documents

documents

Parameter of Multinomial:• Nonnegative• Sum to 1 simplex

Dirichlet can encourage sparsity

Introduction to Machine Learning for Information Retrieval Xiaolong Wang.

Documents

document learning

politicsstructural learning

pcaknn machine learning

numeric multiclass learning

d loss function

log1 expyi

difference bw true label

extract xq