Top Banner
Latent Semantic Indexing Sudarsun. S., M.Tech Checktronix India Pvt Ltd, Chennai 600034 [email protected]
27

Latent Semantic Indexing For Information Retrieval

May 10, 2015

Download

Technology

Introducing Latent Semantic Analysis through Singular Value Decomposition on Text Data for Information Retrieval
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Latent Semantic Indexing For Information Retrieval

Latent Semantic Indexing

Sudarsun. S., M.TechChecktronix India Pvt Ltd,Chennai [email protected]

Page 2: Latent Semantic Indexing For Information Retrieval

Sudarsun - Checktronix R & D 2

What is NLP ?

What is Natural Language ? Can a machine understand NL ?How are we understanding NL ?How can we make a machine understand NL ?What are the limitations ?

Page 3: Latent Semantic Indexing For Information Retrieval

Sudarsun - Checktronix R & D 3

Major Entities …

What is Syntactic Analysis ? Deal Synonymy Deal Polysemy ?

What is Semantics ? Represent meanings as a Semantic Net

What is Knowledge ? How to represent knowledge ?

What are Inferences and Reasoning ? How to use the accumulated knowledge ?

Page 4: Latent Semantic Indexing For Information Retrieval

Sudarsun - Checktronix R & D 4

LSA for Information Retrieval

What is LSA?Singular Value DecompositionMethod of LSAComputation of Similarity using CosineMeasuring SimilaritiesConstruction of Pseudo-documentLimitations of LSAAlternatives to LSA

Page 5: Latent Semantic Indexing For Information Retrieval

Sudarsun - Checktronix R & D 5

What is LSA

A Statistical Method that provides a way to describe the underlying structure of texts

Used in author recognition, search engines, detecting plagiarism, and comparing texts for similarities

The contexts in which a certain word exists or does not exist determine the similarity of the documents

Closely models human learning, especially the manner in which people learn a language and acquire a vocabulary

Page 6: Latent Semantic Indexing For Information Retrieval

Sudarsun - Checktronix R & D 6

Multivariate Data Reduction technique.

Reduces large dataset to a concentrated dataset containing only the significant information from the original data.

Singular Value Decomposition

Page 7: Latent Semantic Indexing For Information Retrieval

Sudarsun - Checktronix R & D 7

Mathematical Background of SVD

SVD decomposes a matrix as a product of 3 matrices.

Let A be matrix of m x n, then SVD of A is

SVD(A) = UMxKSKxKVtKxN

U, V Left and Right Singular matrices respectively

U and V are Orthogonal matrix whose vectors are of unit length

S Diagonal matrix whose diagonal elements are Singular Values arranged in descending order

K Rank of A; K<=min(M,N).

Page 8: Latent Semantic Indexing For Information Retrieval

Sudarsun - Checktronix R & D 8

Computation of SVD

To Find U,S and V matrices

Find Eigen Values and their corresponding Eigen Vectors of the matrix AAt

Singular values = Square root of Eigen Values.

These Singular values arranged in descending order forms the diagonal elements of the diagonal matrix S.

Divide each Eigen vector by its length.

These Eigen vectors forms the columns of the matrix U.

Similarly Eigen Vectors of the matrix AtA forms the columns of matrix V.

[Note: Eigen Values of AAt and AtA are equal.]

Page 9: Latent Semantic Indexing For Information Retrieval

Sudarsun - Checktronix R & D 9

Eigen Value & Vectors

A scalar Lamba is called an Eigen Value of a matrix A if there is a non-zero vector V such that A.V = Lamba.V. This non-zero vector is the Eigen vector of A.Eigen values can be found by solving the equation | A – Lamba.I | = 0.

Page 10: Latent Semantic Indexing For Information Retrieval

Sudarsun - Checktronix R & D 10

How to Build LSA ?

Preprocess the document collection Stemming Stop words removal

Build Frequency MatrixApply Pre-weightsDecompose FM into U, S, VProject Queries

Page 11: Latent Semantic Indexing For Information Retrieval

Sudarsun - Checktronix R & D 11

Step #1: Construct the term-document matrix; TDM One column for each document One row for every word The value of cell (i, j) is the frequency of word i in document j

Frequency Matrix

Page 12: Latent Semantic Indexing For Information Retrieval

Sudarsun - Checktronix R & D 12

Page 13: Latent Semantic Indexing For Information Retrieval

Sudarsun - Checktronix R & D 13

Page 14: Latent Semantic Indexing For Information Retrieval

Sudarsun - Checktronix R & D 14

Step #2: Weight FunctionsIncrease the efficiency of the information retrieval.Allocates weights to the terms based on their occurrences.

Each element is replaced with the product of a Local Weight Function(LWF) and a Global Weight Function(GWF).

LWF considers the frequency of a word within a particular text

GWF examines a term’s frequency across all the documents.

Pre-weightingsApplied on the TDM before computing SVD.

Post-weightingsApplied to terms of a query when projected for matching or searching.

Page 15: Latent Semantic Indexing For Information Retrieval

Sudarsun - Checktronix R & D 15

Step #3: SVD

Perform SVD on term-document matrix X.

SVD removes noise or infrequent words that do not help to classify a document.

Octave/Mat lab can be used

[u, s, v] = svd(A);

Page 16: Latent Semantic Indexing For Information Retrieval

Sudarsun - Checktronix R & D 16

A U

S

Vt

m x n m x k k x k k x n

· ·

Ter

ms

Documents

0

0

Page 17: Latent Semantic Indexing For Information Retrieval

Sudarsun - Checktronix R & D 17

Documents

TDM

SVD

Terms

U S V

Page 18: Latent Semantic Indexing For Information Retrieval

Sudarsun - Checktronix R & D 18

Similarity Computation Using Cosine

Consider 2 vectors A & B. Similarity between these 2 vectors is

A.B CosØ = ------------------

|A|. |B|

CosØ ranges between –1 to +1

Page 19: Latent Semantic Indexing For Information Retrieval

Sudarsun - Checktronix R & D 19

Similarity Computations in LSA

Page 20: Latent Semantic Indexing For Information Retrieval

Sudarsun - Checktronix R & D 20

Term-term SimilarityCompute the Cosine for the row vectors

of term ‘i’ and term ‘j’ in the U*S matrix.

US

Page 21: Latent Semantic Indexing For Information Retrieval

Sudarsun - Checktronix R & D 21

Document – Document Similarity

Compute the Cosine for the column vectors of document ‘i’ and document ‘j’ in the S*Vt matrix.SVt

Page 22: Latent Semantic Indexing For Information Retrieval

Sudarsun - Checktronix R & D 22

Term – Document Similarity

Compute Cosine between row vector of term ‘i’ in U*S1/2 matrix and column vector of document ‘j’ in S1/2*Vt matrix.

Page 23: Latent Semantic Indexing For Information Retrieval

Sudarsun - Checktronix R & D 23

U*S1/2

S1/2*Vt

Page 24: Latent Semantic Indexing For Information Retrieval

Sudarsun - Checktronix R & D 24

Construction of Pseudo-document A Query is broken in to terms and

represented as a column vector (say ‘q’) consisting of ‘M’ terms as rows.

Then Pseudo-document (Q) for the query(q) can be constructed with the help of following mathematical formula.

Q = qt*U*S-1

After constructing the Pseudo-document, we can compute the similarities of query-term, query-document using earlier mentioned techniques.

Page 25: Latent Semantic Indexing For Information Retrieval

Sudarsun - Checktronix R & D 25

Alternatives to LSA

LSA is limited to Synonymy problem

PLSA – Probabilistic Latent Semantic Analysis to handle Polysemy.

LDA – Latent Dirichlet Allocation.

Page 26: Latent Semantic Indexing For Information Retrieval

Sudarsun - Checktronix R & D 26

References

http://www.cs.utk.edu/~lsi/papers/http://www.cs.utk.edu/~berry/lsi++http://people.csail.mit.edu/fergus/iccv2005/bagwords.htmlhttp://research.nitle.org/lsi/lsa_explanation.htmhttp://en.wikipedia.org/wiki/Latent_semantic_analysishttp://www-psych.nmsu.edu/~pfoltz/reprints/BRMIC96.htmlhttp://www.pcug.org.au/~jdowling/http://www.ucl.ac.uk/oncology/MicroCore/HTML_resource/PCA_1.htmhttp://public.lanl.gov/mewall/kluwer2002.htmlhttp://www.cs.utexas.edu/users/suvrit/work/progs/ssvd.html

Page 27: Latent Semantic Indexing For Information Retrieval

Sudarsun - Checktronix R & D 27

Thanks..

You may send in your queries to [email protected]