Introduction to Machine Learning for Information Retrieval Xiaolong Wang
Dec 28, 2015
Introduction to Machine Learning for Information Retrieval
Xiaolong Wang
What is Machine Learning
• In short, tricks of maths• Two major tasks:– Supervised Learning: • a.k.a. Regression, Classification…
– Unsupervised Learning:• a.k.a. data manipulation, clustering …
Supervised Learning
• Label : usually manually labeled• Data : data representation, usually as a vector• Prediction Function : selecting one from a predefined family
of functions that has the best prediction
classification regression
Supervised Learning
• Two formulations:– F1: Given a set of Xi, Yi, learn a function• Yi
– Binary: Spam v.s. Non-spam– Numeric: Very relevant(5), somewhat relevant(4), marginal
relevant(3), somewhat irrelevant(2), very irrelevant(1)
• Xi
– Number of words, occurrence of each word, …
• f– usually linear function
Supervised Learning
• Two formulations:– F2: Give a set of Xi, Yi ,learn a function such
that• Yi: more complex label than binary or numeric
– Multiclass learning: entertainment v.s. sports v.s. politics…– Structural learning: syntactic parsing
more general
Y
X
Supervised Learning
• Training– Optimization:• Loss: difference b/w true label Yi and predicted label wTXi
– Squared Loss (regression): (Yi – wTXi)2
– Hinge Loss (classification): max(0, 1 – Yi .wTXi)
– Logistic Loss (classification): log(1 + exp(-Yi .wTXi))
Supervised Learning
• Training– Optimization:• Regularization:
Without regularization: overfitting
Supervised Learning
• Training– Optimization:• Regularization:
Large margin, small ||w||
Supervised Learning
• Optimization:– Art of maximization• Unconstraint:
– First order: Gradient descent– Second order: Newtonian method– Stochastic: stochastic gradient descent (SGD)
• Constraint:– Active set method– Interior Point Method– Alternative Direction Method of Multiplier (ADMM)
Unsupervised Learning
• Clustering: – PCA– kNN
Machine Learning for Information Retrieval
• Learning to Rank• Topic Modeling
Learning to Rank
http://research.microsoft.com/en-us/people/hangli/li-acl-ijcnlp-2009-tutorial.pdf
Learning to Rank
• X = (q, d)– Features: e.g. Matching between Query and Document
Learning to Rank
Learning to Rank
• Labels:– Pointwise: relevant vs. irrelevant; 5,4,3,2,1– Pairwise: doc A > doc B, doc C > doc D– Listwise: permutation
• Acquisition:– Expert Annotation– Clickthrough: click ,skip above
Learning to Rank
Learning to Rank
• Prediction function:– Extract Xq,d from (q, d)
– Ranking document by sorting wT Xq,d
• Loss function:– Pointwise– Pairwise– Listwise
Learning to Rank
• Pointwise:– Regression: Square loss
• Pairwise:– Classification: (q, d1) > (q, d2) => positive example Xq,d1 – Xq, d2
• Listwise:– Optimization: NDCG@j
Relevance (0/1) of document at rank i
Discount of rank i
Cumulative
Gain
Normalized
Topic Modeling
• Topic Modeling– Factorization of Words * Documents matrix
• Clustering of document– Project documents (vector of # vocabulary) into lower dimension (vector
of # topics)
• What is Topic?– Linear combination of words
• Nonnegative weights, sum to 1 => probability
Topic Modeling
• Generative models: story-telling– Latent Semantic Analysis, LSA– Probabilistic Latent Semantic Analysis, PLSA– Latent Dirichlet Allocation, LDA
Topic Modeling
• Latent Semantic Analysis (LSA): – Deerwester et al (1990)– Singular Value Decomposition (SVD) applied to
words * documents matrix
– How to interpret negative values?
Topic Modeling• Probabilistic Latent Semantic Analysis (PLSA):
– Thomas Hofmann (1999)– How words/documents are generated (as described by probability)
d1, fish
d1, boat
d1, voyage
d2, voyage
d2, sky
d3, trip
……
documents
topics
topics
wor
ds
documents
documents
Maximal Likelihood:
Topic Modeling• Latent Dirichlet Allocation (LDA)
– David Blei et al. (2003)– PLSA with a Dirichlet prior
• What is Bayesian inference? Conjugate Prior? Posterior? Frequentist v.s. Bayesian Tossing a Coin
Parameter to be estimated priorlikelihood
Posterior probability
• Canonical Maximal Likelihood (Frequentist) as a special form of Bayesian Maximal a Posterior (MAP) when g(r) is uniform prior• Bayesian as an inference method:
• Estimate r: posterior mean, or MAP• Estimate new toss to be head:
Topic Modeling• Latent Dirichlet Allocation (LDA)
– David Blei et al. (2003)– PLSA with a Dirichlet prior
• What additional info we know about ?– Sparsity:
• each topic has nonzero probability on few words;• each document has nonzero probability on few topics;
Dirichlet distribution defines probability on simplex
documents
topics
topics
wor
ds
documents
documents
Parameter of Multinomial:• Nonnegative• Sum to 1 simplex
Dirichlet can encourage sparsity