Top Banner
MAT 167: Applied Linear Algebra Lecture 22: Text Mining Naoki Saito Department of Mathematics University of California, Davis May 19 & 22, 2017 [email protected] (UC Davis) Text Mining May 19 & 22, 2017 1 / 21
70

MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

May 11, 2023

Download

Documents

Khang Minh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

MAT 167: Applied Linear AlgebraLecture 22: Text Mining

Naoki Saito

Department of MathematicsUniversity of California, Davis

May 19 & 22, 2017

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 1 / 21

Page 2: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Outline

1 Introduction

2 Preprocessing the Documents and Queries

3 The Vector Space Model

4 Latent Semantic Indexing

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 2 / 21

Page 3: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Introduction

Outline

1 Introduction

2 Preprocessing the Documents and Queries

3 The Vector Space Model

4 Latent Semantic Indexing

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 3 / 21

Page 4: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Introduction

What Is Text Mining?

Text mining = Methods for extracting useful information from largeand often unstructured collections of texts.It is also closely related to “information retrieval.”In this context, keywords that carry information about the contents ofa document are called terms.A list of all the terms in a document is called an index.For each term, a list of all the documents that contain that particularterm is called an inverted index.A typical application is to search databases of scientific papers forgiven query terms.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 4 / 21

Page 5: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Introduction

What Is Text Mining?

Text mining = Methods for extracting useful information from largeand often unstructured collections of texts.It is also closely related to “information retrieval.”In this context, keywords that carry information about the contents ofa document are called terms.A list of all the terms in a document is called an index.For each term, a list of all the documents that contain that particularterm is called an inverted index.A typical application is to search databases of scientific papers forgiven query terms.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 4 / 21

Page 6: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Introduction

What Is Text Mining?

Text mining = Methods for extracting useful information from largeand often unstructured collections of texts.It is also closely related to “information retrieval.”In this context, keywords that carry information about the contents ofa document are called terms.A list of all the terms in a document is called an index.For each term, a list of all the documents that contain that particularterm is called an inverted index.A typical application is to search databases of scientific papers forgiven query terms.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 4 / 21

Page 7: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Introduction

What Is Text Mining?

Text mining = Methods for extracting useful information from largeand often unstructured collections of texts.It is also closely related to “information retrieval.”In this context, keywords that carry information about the contents ofa document are called terms.A list of all the terms in a document is called an index.For each term, a list of all the documents that contain that particularterm is called an inverted index.A typical application is to search databases of scientific papers forgiven query terms.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 4 / 21

Page 8: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Introduction

What Is Text Mining?

Text mining = Methods for extracting useful information from largeand often unstructured collections of texts.It is also closely related to “information retrieval.”In this context, keywords that carry information about the contents ofa document are called terms.A list of all the terms in a document is called an index.For each term, a list of all the documents that contain that particularterm is called an inverted index.A typical application is to search databases of scientific papers forgiven query terms.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 4 / 21

Page 9: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Introduction

What Is Text Mining?

Text mining = Methods for extracting useful information from largeand often unstructured collections of texts.It is also closely related to “information retrieval.”In this context, keywords that carry information about the contents ofa document are called terms.A list of all the terms in a document is called an index.For each term, a list of all the documents that contain that particularterm is called an inverted index.A typical application is to search databases of scientific papers forgiven query terms.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 4 / 21

Page 10: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Introduction

Because of Lecture 2 and HW #1, you should already be familiar withthe concept of term-document matrix.Each column represents a document while each row represents a term.The ijth entry of such a matrix normally represents the frequency ofoccurrence of term i in document j .In reality, the size of such matrices are huge (' 105×106).However, fortunately, most of the times, they are quite sparse.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 5 / 21

Page 11: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Introduction

Because of Lecture 2 and HW #1, you should already be familiar withthe concept of term-document matrix.Each column represents a document while each row represents a term.The ijth entry of such a matrix normally represents the frequency ofoccurrence of term i in document j .In reality, the size of such matrices are huge (' 105×106).However, fortunately, most of the times, they are quite sparse.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 5 / 21

Page 12: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Introduction

Because of Lecture 2 and HW #1, you should already be familiar withthe concept of term-document matrix.Each column represents a document while each row represents a term.The ijth entry of such a matrix normally represents the frequency ofoccurrence of term i in document j .In reality, the size of such matrices are huge (' 105×106).However, fortunately, most of the times, they are quite sparse.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 5 / 21

Page 13: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Introduction

Because of Lecture 2 and HW #1, you should already be familiar withthe concept of term-document matrix.Each column represents a document while each row represents a term.The ijth entry of such a matrix normally represents the frequency ofoccurrence of term i in document j .In reality, the size of such matrices are huge (' 105×106).However, fortunately, most of the times, they are quite sparse.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 5 / 21

Page 14: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Introduction

Because of Lecture 2 and HW #1, you should already be familiar withthe concept of term-document matrix.Each column represents a document while each row represents a term.The ijth entry of such a matrix normally represents the frequency ofoccurrence of term i in document j .In reality, the size of such matrices are huge (' 105×106).However, fortunately, most of the times, they are quite sparse.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 5 / 21

Page 15: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Introduction

The NIPS Dataset

In this lecture, we will use the following ’Bags of Words’ datasetavailable from the UCI Machine Learning Repository:http://archive.ics.uci.edu/ml/datasets/Bag+of+Words .This is a collection of 1500 (= n) articles (mostly in the field ofmachine learning and computational neuroscience) published in theproceedings of Conference on Neural Information Processing Systems(NIPS) over certain periods.The total number of terms (words) examined for these articles is12419 (=m).More precisely, after tokenization (i.e., breaking a stream of text upinto words, phrases, symbols, or other meaningful elements calledtokens), and removal of stop words (i.e., common words that do notgive useful info; more about these in the next section), the vocabularyof unique words was truncated by only keeping words that occurredmore than ten times.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 6 / 21

Page 16: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Introduction

The NIPS Dataset

In this lecture, we will use the following ’Bags of Words’ datasetavailable from the UCI Machine Learning Repository:http://archive.ics.uci.edu/ml/datasets/Bag+of+Words .This is a collection of 1500 (= n) articles (mostly in the field ofmachine learning and computational neuroscience) published in theproceedings of Conference on Neural Information Processing Systems(NIPS) over certain periods.The total number of terms (words) examined for these articles is12419 (=m).More precisely, after tokenization (i.e., breaking a stream of text upinto words, phrases, symbols, or other meaningful elements calledtokens), and removal of stop words (i.e., common words that do notgive useful info; more about these in the next section), the vocabularyof unique words was truncated by only keeping words that occurredmore than ten times.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 6 / 21

Page 17: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Introduction

The NIPS Dataset

In this lecture, we will use the following ’Bags of Words’ datasetavailable from the UCI Machine Learning Repository:http://archive.ics.uci.edu/ml/datasets/Bag+of+Words .This is a collection of 1500 (= n) articles (mostly in the field ofmachine learning and computational neuroscience) published in theproceedings of Conference on Neural Information Processing Systems(NIPS) over certain periods.The total number of terms (words) examined for these articles is12419 (=m).More precisely, after tokenization (i.e., breaking a stream of text upinto words, phrases, symbols, or other meaningful elements calledtokens), and removal of stop words (i.e., common words that do notgive useful info; more about these in the next section), the vocabularyof unique words was truncated by only keeping words that occurredmore than ten times.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 6 / 21

Page 18: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Introduction

The NIPS Dataset

In this lecture, we will use the following ’Bags of Words’ datasetavailable from the UCI Machine Learning Repository:http://archive.ics.uci.edu/ml/datasets/Bag+of+Words .This is a collection of 1500 (= n) articles (mostly in the field ofmachine learning and computational neuroscience) published in theproceedings of Conference on Neural Information Processing Systems(NIPS) over certain periods.The total number of terms (words) examined for these articles is12419 (=m).More precisely, after tokenization (i.e., breaking a stream of text upinto words, phrases, symbols, or other meaningful elements calledtokens), and removal of stop words (i.e., common words that do notgive useful info; more about these in the next section), the vocabularyof unique words was truncated by only keeping words that occurredmore than ten times.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 6 / 21

Page 19: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Introduction

First 10 words sorted in the alphabetical order: ‘a2i’, ‘aaa’, ‘aaai’,‘aapo’, ‘aat’, ‘aazhang’, ‘abandonment’, ‘abbott’, ‘abbreviated’,‘abcde’.10 most frequently used words: ‘network’, ‘model’, ‘learning’,‘function’, ‘input’, ‘neural’, ‘set’, ‘algorithm’, ‘system’, ‘data’.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 7 / 21

Page 20: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Introduction

First 10 words sorted in the alphabetical order: ‘a2i’, ‘aaa’, ‘aaai’,‘aapo’, ‘aat’, ‘aazhang’, ‘abandonment’, ‘abbott’, ‘abbreviated’,‘abcde’.10 most frequently used words: ‘network’, ‘model’, ‘learning’,‘function’, ‘input’, ‘neural’, ‘set’, ‘algorithm’, ‘system’, ‘data’.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 7 / 21

Page 21: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Preprocessing the Documents and Queries

Outline

1 Introduction

2 Preprocessing the Documents and Queries

3 The Vector Space Model

4 Latent Semantic Indexing

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 8 / 21

Page 22: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Preprocessing the Documents and Queries

Preprocessing the Documents and Queries

Before the index (a list of terms contained in a given document) ismade, we need to do the following two preprocessing steps:

1 Elimination of stop words2 Stemming

Stop words are words that can be found in virtually any document(i.e., most likely useless words to characterize the documents), e.g.,‘a’, ‘able’, ‘about’, ‘above’, ‘according’, ‘accordingly’, ‘across’,‘actually’, ‘after’, . . .Stemming is the process of reducing each word that is conjugated orhas a suffix to its stem. For example, ‘fishing’, ‘fished’, ‘fish’, ‘fisher’stemming=⇒ ‘fish’ (the root word).There are some public domain stemming software systems; see‘Stemming’ page in Wikipedia.Note that stemming was not performed in the NIPS dataset, e.g., theterms include ‘model’, ‘modeled’, ‘modeling’, ‘modelled’, ‘modelling’.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 9 / 21

Page 23: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Preprocessing the Documents and Queries

Preprocessing the Documents and Queries

Before the index (a list of terms contained in a given document) ismade, we need to do the following two preprocessing steps:

1 Elimination of stop words2 Stemming

Stop words are words that can be found in virtually any document(i.e., most likely useless words to characterize the documents), e.g.,‘a’, ‘able’, ‘about’, ‘above’, ‘according’, ‘accordingly’, ‘across’,‘actually’, ‘after’, . . .Stemming is the process of reducing each word that is conjugated orhas a suffix to its stem. For example, ‘fishing’, ‘fished’, ‘fish’, ‘fisher’stemming=⇒ ‘fish’ (the root word).There are some public domain stemming software systems; see‘Stemming’ page in Wikipedia.Note that stemming was not performed in the NIPS dataset, e.g., theterms include ‘model’, ‘modeled’, ‘modeling’, ‘modelled’, ‘modelling’.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 9 / 21

Page 24: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Preprocessing the Documents and Queries

Preprocessing the Documents and Queries

Before the index (a list of terms contained in a given document) ismade, we need to do the following two preprocessing steps:

1 Elimination of stop words2 Stemming

Stop words are words that can be found in virtually any document(i.e., most likely useless words to characterize the documents), e.g.,‘a’, ‘able’, ‘about’, ‘above’, ‘according’, ‘accordingly’, ‘across’,‘actually’, ‘after’, . . .Stemming is the process of reducing each word that is conjugated orhas a suffix to its stem. For example, ‘fishing’, ‘fished’, ‘fish’, ‘fisher’stemming=⇒ ‘fish’ (the root word).There are some public domain stemming software systems; see‘Stemming’ page in Wikipedia.Note that stemming was not performed in the NIPS dataset, e.g., theterms include ‘model’, ‘modeled’, ‘modeling’, ‘modelled’, ‘modelling’.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 9 / 21

Page 25: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Preprocessing the Documents and Queries

Preprocessing the Documents and Queries

Before the index (a list of terms contained in a given document) ismade, we need to do the following two preprocessing steps:

1 Elimination of stop words2 Stemming

Stop words are words that can be found in virtually any document(i.e., most likely useless words to characterize the documents), e.g.,‘a’, ‘able’, ‘about’, ‘above’, ‘according’, ‘accordingly’, ‘across’,‘actually’, ‘after’, . . .Stemming is the process of reducing each word that is conjugated orhas a suffix to its stem. For example, ‘fishing’, ‘fished’, ‘fish’, ‘fisher’stemming=⇒ ‘fish’ (the root word).There are some public domain stemming software systems; see‘Stemming’ page in Wikipedia.Note that stemming was not performed in the NIPS dataset, e.g., theterms include ‘model’, ‘modeled’, ‘modeling’, ‘modelled’, ‘modelling’.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 9 / 21

Page 26: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Preprocessing the Documents and Queries

Preprocessing the Documents and Queries

Before the index (a list of terms contained in a given document) ismade, we need to do the following two preprocessing steps:

1 Elimination of stop words2 Stemming

Stop words are words that can be found in virtually any document(i.e., most likely useless words to characterize the documents), e.g.,‘a’, ‘able’, ‘about’, ‘above’, ‘according’, ‘accordingly’, ‘across’,‘actually’, ‘after’, . . .Stemming is the process of reducing each word that is conjugated orhas a suffix to its stem. For example, ‘fishing’, ‘fished’, ‘fish’, ‘fisher’stemming=⇒ ‘fish’ (the root word).There are some public domain stemming software systems; see‘Stemming’ page in Wikipedia.Note that stemming was not performed in the NIPS dataset, e.g., theterms include ‘model’, ‘modeled’, ‘modeling’, ‘modelled’, ‘modelling’.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 9 / 21

Page 27: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Preprocessing the Documents and Queries

Preprocessing the Documents and Queries

Before the index (a list of terms contained in a given document) ismade, we need to do the following two preprocessing steps:

1 Elimination of stop words2 Stemming

Stop words are words that can be found in virtually any document(i.e., most likely useless words to characterize the documents), e.g.,‘a’, ‘able’, ‘about’, ‘above’, ‘according’, ‘accordingly’, ‘across’,‘actually’, ‘after’, . . .Stemming is the process of reducing each word that is conjugated orhas a suffix to its stem. For example, ‘fishing’, ‘fished’, ‘fish’, ‘fisher’stemming=⇒ ‘fish’ (the root word).There are some public domain stemming software systems; see‘Stemming’ page in Wikipedia.Note that stemming was not performed in the NIPS dataset, e.g., theterms include ‘model’, ‘modeled’, ‘modeling’, ‘modelled’, ‘modelling’.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 9 / 21

Page 28: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Preprocessing the Documents and Queries

Preprocessing the Documents and Queries

Before the index (a list of terms contained in a given document) ismade, we need to do the following two preprocessing steps:

1 Elimination of stop words2 Stemming

Stop words are words that can be found in virtually any document(i.e., most likely useless words to characterize the documents), e.g.,‘a’, ‘able’, ‘about’, ‘above’, ‘according’, ‘accordingly’, ‘across’,‘actually’, ‘after’, . . .Stemming is the process of reducing each word that is conjugated orhas a suffix to its stem. For example, ‘fishing’, ‘fished’, ‘fish’, ‘fisher’stemming=⇒ ‘fish’ (the root word).There are some public domain stemming software systems; see‘Stemming’ page in Wikipedia.Note that stemming was not performed in the NIPS dataset, e.g., theterms include ‘model’, ‘modeled’, ‘modeling’, ‘modelled’, ‘modelling’.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 9 / 21

Page 29: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

The Vector Space Model

Outline

1 Introduction

2 Preprocessing the Documents and Queries

3 The Vector Space Model

4 Latent Semantic Indexing

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 10 / 21

Page 30: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

The Vector Space Model

The Vector Space Model

The main idea of this model is to create a term-document matrix, say,A= (aij ) ∈Rm×n, where each document is represented by a column vector ajthat has nonzero entries in the position that correspond to terms found inthat document.

Consequently, each row represents a term and has nonzero entries in thosepositions that correspond to the documents where that term can be found,i.e., the inverted index.

In practice, a text parser (a program) is used to create term-documentmatrices, and does stemming and stop words removal too.

The entry aij is normally set to the term frequency fij , i.e., the number oftimes term i appears in document j .

Can have weights, e.g., aij = fij log(n/ni ) where ni is the number ofdocuments that contain term i . If term i occurs frequently in only a fewdocuments, then the log factor becomes significant. On the other hand, ifterm i occurs many documents, the log factor makes aij ≈ 0, i.e., term i isnot useful. Stop words removal mitigates this to some extent.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 11 / 21

Page 31: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

The Vector Space Model

The Vector Space Model

The main idea of this model is to create a term-document matrix, say,A= (aij ) ∈Rm×n, where each document is represented by a column vector ajthat has nonzero entries in the position that correspond to terms found inthat document.

Consequently, each row represents a term and has nonzero entries in thosepositions that correspond to the documents where that term can be found,i.e., the inverted index.

In practice, a text parser (a program) is used to create term-documentmatrices, and does stemming and stop words removal too.

The entry aij is normally set to the term frequency fij , i.e., the number oftimes term i appears in document j .

Can have weights, e.g., aij = fij log(n/ni ) where ni is the number ofdocuments that contain term i . If term i occurs frequently in only a fewdocuments, then the log factor becomes significant. On the other hand, ifterm i occurs many documents, the log factor makes aij ≈ 0, i.e., term i isnot useful. Stop words removal mitigates this to some extent.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 11 / 21

Page 32: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

The Vector Space Model

The Vector Space Model

The main idea of this model is to create a term-document matrix, say,A= (aij ) ∈Rm×n, where each document is represented by a column vector ajthat has nonzero entries in the position that correspond to terms found inthat document.

Consequently, each row represents a term and has nonzero entries in thosepositions that correspond to the documents where that term can be found,i.e., the inverted index.

In practice, a text parser (a program) is used to create term-documentmatrices, and does stemming and stop words removal too.

The entry aij is normally set to the term frequency fij , i.e., the number oftimes term i appears in document j .

Can have weights, e.g., aij = fij log(n/ni ) where ni is the number ofdocuments that contain term i . If term i occurs frequently in only a fewdocuments, then the log factor becomes significant. On the other hand, ifterm i occurs many documents, the log factor makes aij ≈ 0, i.e., term i isnot useful. Stop words removal mitigates this to some extent.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 11 / 21

Page 33: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

The Vector Space Model

The Vector Space Model

The main idea of this model is to create a term-document matrix, say,A= (aij ) ∈Rm×n, where each document is represented by a column vector ajthat has nonzero entries in the position that correspond to terms found inthat document.

Consequently, each row represents a term and has nonzero entries in thosepositions that correspond to the documents where that term can be found,i.e., the inverted index.

In practice, a text parser (a program) is used to create term-documentmatrices, and does stemming and stop words removal too.

The entry aij is normally set to the term frequency fij , i.e., the number oftimes term i appears in document j .

Can have weights, e.g., aij = fij log(n/ni ) where ni is the number ofdocuments that contain term i . If term i occurs frequently in only a fewdocuments, then the log factor becomes significant. On the other hand, ifterm i occurs many documents, the log factor makes aij ≈ 0, i.e., term i isnot useful. Stop words removal mitigates this to some extent.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 11 / 21

Page 34: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

The Vector Space Model

The Vector Space Model

The main idea of this model is to create a term-document matrix, say,A= (aij ) ∈Rm×n, where each document is represented by a column vector ajthat has nonzero entries in the position that correspond to terms found inthat document.

Consequently, each row represents a term and has nonzero entries in thosepositions that correspond to the documents where that term can be found,i.e., the inverted index.

In practice, a text parser (a program) is used to create term-documentmatrices, and does stemming and stop words removal too.

The entry aij is normally set to the term frequency fij , i.e., the number oftimes term i appears in document j .

Can have weights, e.g., aij = fij log(n/ni ) where ni is the number ofdocuments that contain term i . If term i occurs frequently in only a fewdocuments, then the log factor becomes significant. On the other hand, ifterm i occurs many documents, the log factor makes aij ≈ 0, i.e., term i isnot useful. Stop words removal mitigates this to some extent.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 11 / 21

Page 35: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

The Vector Space Model

Usually, the term-document matrix is sparse. For example, in the NIPSdataset, the number of nonzero entries in the term-document matrix of size12419×1500 is 746,316, which is only 4% of the whole matrix entries.

Figure: The first 1000 rows of the NIPS term-document matrix. Each dotrepresents a nonzero entry.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 12 / 21

Page 36: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

The Vector Space Model

Query Matching

Query matching = a process of finding the relevant documents for agiven query vector q ∈Rm.We must define a distance or similarity between q and each documentaj ∈Rm, j = 1 : n.Often the following cosine distance (in fact, it would be better to saysimilarity rather than distance) is used:

cos(θ(q,aj))=qTaj

‖q‖2‖aj‖2.

If θ(q,aj) is small enough, then aj is deemed relevant.More precisely, we set some predefined tolerance and ifcos(θ(q,aj))> tol, then aj is deemed relevant.The smaller the value of tol is, the more documents are retrieved andconsidered as relevant even if many of them are not really relevant.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 13 / 21

Page 37: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

The Vector Space Model

Query Matching

Query matching = a process of finding the relevant documents for agiven query vector q ∈Rm.We must define a distance or similarity between q and each documentaj ∈Rm, j = 1 : n.Often the following cosine distance (in fact, it would be better to saysimilarity rather than distance) is used:

cos(θ(q,aj))=qTaj

‖q‖2‖aj‖2.

If θ(q,aj) is small enough, then aj is deemed relevant.More precisely, we set some predefined tolerance and ifcos(θ(q,aj))> tol, then aj is deemed relevant.The smaller the value of tol is, the more documents are retrieved andconsidered as relevant even if many of them are not really relevant.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 13 / 21

Page 38: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

The Vector Space Model

Query Matching

Query matching = a process of finding the relevant documents for agiven query vector q ∈Rm.We must define a distance or similarity between q and each documentaj ∈Rm, j = 1 : n.Often the following cosine distance (in fact, it would be better to saysimilarity rather than distance) is used:

cos(θ(q,aj))=qTaj

‖q‖2‖aj‖2.

If θ(q,aj) is small enough, then aj is deemed relevant.More precisely, we set some predefined tolerance and ifcos(θ(q,aj))> tol, then aj is deemed relevant.The smaller the value of tol is, the more documents are retrieved andconsidered as relevant even if many of them are not really relevant.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 13 / 21

Page 39: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

The Vector Space Model

Query Matching

Query matching = a process of finding the relevant documents for agiven query vector q ∈Rm.We must define a distance or similarity between q and each documentaj ∈Rm, j = 1 : n.Often the following cosine distance (in fact, it would be better to saysimilarity rather than distance) is used:

cos(θ(q,aj))=qTaj

‖q‖2‖aj‖2.

If θ(q,aj) is small enough, then aj is deemed relevant.More precisely, we set some predefined tolerance and ifcos(θ(q,aj))> tol, then aj is deemed relevant.The smaller the value of tol is, the more documents are retrieved andconsidered as relevant even if many of them are not really relevant.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 13 / 21

Page 40: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

The Vector Space Model

Query Matching

Query matching = a process of finding the relevant documents for agiven query vector q ∈Rm.We must define a distance or similarity between q and each documentaj ∈Rm, j = 1 : n.Often the following cosine distance (in fact, it would be better to saysimilarity rather than distance) is used:

cos(θ(q,aj))=qTaj

‖q‖2‖aj‖2.

If θ(q,aj) is small enough, then aj is deemed relevant.More precisely, we set some predefined tolerance and ifcos(θ(q,aj))> tol, then aj is deemed relevant.The smaller the value of tol is, the more documents are retrieved andconsidered as relevant even if many of them are not really relevant.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 13 / 21

Page 41: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

The Vector Space Model

Query Matching

Query matching = a process of finding the relevant documents for agiven query vector q ∈Rm.We must define a distance or similarity between q and each documentaj ∈Rm, j = 1 : n.Often the following cosine distance (in fact, it would be better to saysimilarity rather than distance) is used:

cos(θ(q,aj))=qTaj

‖q‖2‖aj‖2.

If θ(q,aj) is small enough, then aj is deemed relevant.More precisely, we set some predefined tolerance and ifcos(θ(q,aj))> tol, then aj is deemed relevant.The smaller the value of tol is, the more documents are retrieved andconsidered as relevant even if many of them are not really relevant.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 13 / 21

Page 42: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

The Vector Space Model

A Query Matching Example

Let’s consider the NIPS dataset and set upq ∈R12419 = e3528+e6700+e6932, i.e., only three nonzero entries thatcorrespond to the three terms, ‘entropy’, ‘minimum’, ‘maximum’.Compute cos(θ(q,aj)), j = 1 : 1500.

Figure: tol=0.2, 0.1, 0.05 correspond to 4, 15, 89 returned documents.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 14 / 21

Page 43: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

The Vector Space Model

A Query Matching Example

Let’s consider the NIPS dataset and set upq ∈R12419 = e3528+e6700+e6932, i.e., only three nonzero entries thatcorrespond to the three terms, ‘entropy’, ‘minimum’, ‘maximum’.Compute cos(θ(q,aj)), j = 1 : 1500.

Figure: tol=0.2, 0.1, 0.05 correspond to 4, 15, 89 returned documents.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 14 / 21

Page 44: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

The Vector Space Model

A Query Matching Example

Let’s consider the NIPS dataset and set upq ∈R12419 = e3528+e6700+e6932, i.e., only three nonzero entries thatcorrespond to the three terms, ‘entropy’, ‘minimum’, ‘maximum’.Compute cos(θ(q,aj)), j = 1 : 1500.

Figure: tol=0.2, 0.1, 0.05 correspond to 4, 15, 89 returned [email protected] (UC Davis) Text Mining May 19 & 22, 2017 14 / 21

Page 45: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

The Vector Space Model

Performance Modeling

Let us define the following quantities:

Precision: P := Dr

Dt;

Recall: R := Dr

Nr,

where Dr ,Dt ,Nr are the number of relevant documents retrieved, thetotal number of documents retrieved, and the total number of relevantdocuments in the database, respectively.If we set tol large in the cosine similarity measure, then we expect tohave high P but low R .On the other hand, if we set tol small, the situation is the other wayaround.Unfortunately, in the NIPS dataset, there is no information on thedocuments except those terms used in them. Hence, we cannot reallycompute “the Recall vs Precision plot” like those in the textbook.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 15 / 21

Page 46: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

The Vector Space Model

Performance Modeling

Let us define the following quantities:

Precision: P := Dr

Dt;

Recall: R := Dr

Nr,

where Dr ,Dt ,Nr are the number of relevant documents retrieved, thetotal number of documents retrieved, and the total number of relevantdocuments in the database, respectively.If we set tol large in the cosine similarity measure, then we expect tohave high P but low R .On the other hand, if we set tol small, the situation is the other wayaround.Unfortunately, in the NIPS dataset, there is no information on thedocuments except those terms used in them. Hence, we cannot reallycompute “the Recall vs Precision plot” like those in the textbook.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 15 / 21

Page 47: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

The Vector Space Model

Performance Modeling

Let us define the following quantities:

Precision: P := Dr

Dt;

Recall: R := Dr

Nr,

where Dr ,Dt ,Nr are the number of relevant documents retrieved, thetotal number of documents retrieved, and the total number of relevantdocuments in the database, respectively.If we set tol large in the cosine similarity measure, then we expect tohave high P but low R .On the other hand, if we set tol small, the situation is the other wayaround.Unfortunately, in the NIPS dataset, there is no information on thedocuments except those terms used in them. Hence, we cannot reallycompute “the Recall vs Precision plot” like those in the textbook.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 15 / 21

Page 48: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

The Vector Space Model

Performance Modeling

Let us define the following quantities:

Precision: P := Dr

Dt;

Recall: R := Dr

Nr,

where Dr ,Dt ,Nr are the number of relevant documents retrieved, thetotal number of documents retrieved, and the total number of relevantdocuments in the database, respectively.If we set tol large in the cosine similarity measure, then we expect tohave high P but low R .On the other hand, if we set tol small, the situation is the other wayaround.Unfortunately, in the NIPS dataset, there is no information on thedocuments except those terms used in them. Hence, we cannot reallycompute “the Recall vs Precision plot” like those in the textbook.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 15 / 21

Page 49: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Latent Semantic Indexing

Outline

1 Introduction

2 Preprocessing the Documents and Queries

3 The Vector Space Model

4 Latent Semantic Indexing

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 16 / 21

Page 50: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Latent Semantic Indexing

Latent Semantic Indexing (LSI)

Is an indexing and retrieval method that uses SVD to identify patternsin the relationships between the terms and documents.Is based on the principle that words that are used in the same contextstend to have similar meanings.A key feature of LSI: its ability to extract the conceptual content of abody of text by establishing associations between those terms thatoccur in similar contexts.Could trace back its history to factor analysis applications in mid1960s, but it started gaining the popularity in late 80s to early 90s.Nowadays, LSI is being used in many applications on a daily basis.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 17 / 21

Page 51: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Latent Semantic Indexing

Latent Semantic Indexing (LSI)

Is an indexing and retrieval method that uses SVD to identify patternsin the relationships between the terms and documents.Is based on the principle that words that are used in the same contextstend to have similar meanings.A key feature of LSI: its ability to extract the conceptual content of abody of text by establishing associations between those terms thatoccur in similar contexts.Could trace back its history to factor analysis applications in mid1960s, but it started gaining the popularity in late 80s to early 90s.Nowadays, LSI is being used in many applications on a daily basis.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 17 / 21

Page 52: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Latent Semantic Indexing

Latent Semantic Indexing (LSI)

Is an indexing and retrieval method that uses SVD to identify patternsin the relationships between the terms and documents.Is based on the principle that words that are used in the same contextstend to have similar meanings.A key feature of LSI: its ability to extract the conceptual content of abody of text by establishing associations between those terms thatoccur in similar contexts.Could trace back its history to factor analysis applications in mid1960s, but it started gaining the popularity in late 80s to early 90s.Nowadays, LSI is being used in many applications on a daily basis.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 17 / 21

Page 53: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Latent Semantic Indexing

Latent Semantic Indexing (LSI)

Is an indexing and retrieval method that uses SVD to identify patternsin the relationships between the terms and documents.Is based on the principle that words that are used in the same contextstend to have similar meanings.A key feature of LSI: its ability to extract the conceptual content of abody of text by establishing associations between those terms thatoccur in similar contexts.Could trace back its history to factor analysis applications in mid1960s, but it started gaining the popularity in late 80s to early 90s.Nowadays, LSI is being used in many applications on a daily basis.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 17 / 21

Page 54: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Latent Semantic Indexing

Let A ∈Rm×n be a term-document matrix, and let Ak :=UkΣkVTk be

the rank k approximation of A using the first k singular values andsingular vectors. Let Hk :=ΣkV T

k , i.e., Ak =UkHk .For an appropriate value of k , A≈Ak . Hence, we have aj ≈Ukhj

where aj and hj are the jth column vectors of A and Hk , respectively.This means that hj are the expansion coefficients of the best k-termapproximation to aj w.r.t. the ONB vectors {u1, . . . ,uk }.Previously, for a given query vector q, in order to compute the cosinesimilarities between q and aj , j = 1 : n, we had to compute qTAfollowed by the normalization by ‖q‖2 and ‖aj‖.Now, let’s replace A by its best k-term approximation Ak , i.e., wecompute: qTAk = qTUkHk = (UT

k q)THk .Hence, we can simplify the cosine similarity computation as follows:

cosθj :=qTkhj

‖q‖2‖hj‖2, qk :=UT

k q.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 18 / 21

Page 55: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Latent Semantic Indexing

Let A ∈Rm×n be a term-document matrix, and let Ak :=UkΣkVTk be

the rank k approximation of A using the first k singular values andsingular vectors. Let Hk :=ΣkV T

k , i.e., Ak =UkHk .For an appropriate value of k , A≈Ak . Hence, we have aj ≈Ukhj

where aj and hj are the jth column vectors of A and Hk , respectively.This means that hj are the expansion coefficients of the best k-termapproximation to aj w.r.t. the ONB vectors {u1, . . . ,uk }.Previously, for a given query vector q, in order to compute the cosinesimilarities between q and aj , j = 1 : n, we had to compute qTAfollowed by the normalization by ‖q‖2 and ‖aj‖.Now, let’s replace A by its best k-term approximation Ak , i.e., wecompute: qTAk = qTUkHk = (UT

k q)THk .Hence, we can simplify the cosine similarity computation as follows:

cosθj :=qTkhj

‖q‖2‖hj‖2, qk :=UT

k q.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 18 / 21

Page 56: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Latent Semantic Indexing

Let A ∈Rm×n be a term-document matrix, and let Ak :=UkΣkVTk be

the rank k approximation of A using the first k singular values andsingular vectors. Let Hk :=ΣkV T

k , i.e., Ak =UkHk .For an appropriate value of k , A≈Ak . Hence, we have aj ≈Ukhj

where aj and hj are the jth column vectors of A and Hk , respectively.This means that hj are the expansion coefficients of the best k-termapproximation to aj w.r.t. the ONB vectors {u1, . . . ,uk }.Previously, for a given query vector q, in order to compute the cosinesimilarities between q and aj , j = 1 : n, we had to compute qTAfollowed by the normalization by ‖q‖2 and ‖aj‖.Now, let’s replace A by its best k-term approximation Ak , i.e., wecompute: qTAk = qTUkHk = (UT

k q)THk .Hence, we can simplify the cosine similarity computation as follows:

cosθj :=qTkhj

‖q‖2‖hj‖2, qk :=UT

k q.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 18 / 21

Page 57: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Latent Semantic Indexing

Let A ∈Rm×n be a term-document matrix, and let Ak :=UkΣkVTk be

the rank k approximation of A using the first k singular values andsingular vectors. Let Hk :=ΣkV T

k , i.e., Ak =UkHk .For an appropriate value of k , A≈Ak . Hence, we have aj ≈Ukhj

where aj and hj are the jth column vectors of A and Hk , respectively.This means that hj are the expansion coefficients of the best k-termapproximation to aj w.r.t. the ONB vectors {u1, . . . ,uk }.Previously, for a given query vector q, in order to compute the cosinesimilarities between q and aj , j = 1 : n, we had to compute qTAfollowed by the normalization by ‖q‖2 and ‖aj‖.Now, let’s replace A by its best k-term approximation Ak , i.e., wecompute: qTAk = qTUkHk = (UT

k q)THk .Hence, we can simplify the cosine similarity computation as follows:

cosθj :=qTkhj

‖q‖2‖hj‖2, qk :=UT

k q.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 18 / 21

Page 58: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Latent Semantic Indexing

Let A ∈Rm×n be a term-document matrix, and let Ak :=UkΣkVTk be

the rank k approximation of A using the first k singular values andsingular vectors. Let Hk :=ΣkV T

k , i.e., Ak =UkHk .For an appropriate value of k , A≈Ak . Hence, we have aj ≈Ukhj

where aj and hj are the jth column vectors of A and Hk , respectively.This means that hj are the expansion coefficients of the best k-termapproximation to aj w.r.t. the ONB vectors {u1, . . . ,uk }.Previously, for a given query vector q, in order to compute the cosinesimilarities between q and aj , j = 1 : n, we had to compute qTAfollowed by the normalization by ‖q‖2 and ‖aj‖.Now, let’s replace A by its best k-term approximation Ak , i.e., wecompute: qTAk = qTUkHk = (UT

k q)THk .Hence, we can simplify the cosine similarity computation as follows:

cosθj :=qTkhj

‖q‖2‖hj‖2, qk :=UT

k q.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 18 / 21

Page 59: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Latent Semantic Indexing

Let A ∈Rm×n be a term-document matrix, and let Ak :=UkΣkVTk be

the rank k approximation of A using the first k singular values andsingular vectors. Let Hk :=ΣkV T

k , i.e., Ak =UkHk .For an appropriate value of k , A≈Ak . Hence, we have aj ≈Ukhj

where aj and hj are the jth column vectors of A and Hk , respectively.This means that hj are the expansion coefficients of the best k-termapproximation to aj w.r.t. the ONB vectors {u1, . . . ,uk }.Previously, for a given query vector q, in order to compute the cosinesimilarities between q and aj , j = 1 : n, we had to compute qTAfollowed by the normalization by ‖q‖2 and ‖aj‖.Now, let’s replace A by its best k-term approximation Ak , i.e., wecompute: qTAk = qTUkHk = (UT

k q)THk .Hence, we can simplify the cosine similarity computation as follows:

cosθj :=qTkhj

‖q‖2‖hj‖2, qk :=UT

k q.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 18 / 21

Page 60: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Latent Semantic Indexing

Note that there is a typo in the textbook Eqn.(11.4). The formula inthe previous page of this slide is correct. In the textbook formula, theauthor normalized it by ‖qk‖2 instead of ‖q‖2. You can show that‖qk‖2 6= ‖q‖2.The reason why we formed Hk and qk is that there is no need toexplicitly compute and store Ak once we have Hk and qk . Directlydealing with Ak by computing and storing it is wasteful andtime-consuming particularly for a large A.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 19 / 21

Page 61: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Latent Semantic Indexing

Note that there is a typo in the textbook Eqn.(11.4). The formula inthe previous page of this slide is correct. In the textbook formula, theauthor normalized it by ‖qk‖2 instead of ‖q‖2. You can show that‖qk‖2 6= ‖q‖2.The reason why we formed Hk and qk is that there is no need toexplicitly compute and store Ak once we have Hk and qk . Directlydealing with Ak by computing and storing it is wasteful andtime-consuming particularly for a large A.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 19 / 21

Page 62: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Latent Semantic Indexing

An LSI Query ExampleLet’s use the NIPS dataset with k = 100.Then, the relative error of A100 and A in terms of the Frobenius norm,i.e., ‖A−A100‖F/‖A‖F was 0.6074, which is still large.Nonetheless, we get the relatively good performance.

Figure: With the best 100 term approximation, tol=0.2, 0.1, 0.05 correspond to0, 4, 72 returned documents; Compare with the no approximation case: 4, 15, 89.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 20 / 21

Page 63: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Latent Semantic Indexing

An LSI Query ExampleLet’s use the NIPS dataset with k = 100.Then, the relative error of A100 and A in terms of the Frobenius norm,i.e., ‖A−A100‖F/‖A‖F was 0.6074, which is still large.Nonetheless, we get the relatively good performance.

Figure: With the best 100 term approximation, tol=0.2, 0.1, 0.05 correspond to0, 4, 72 returned documents; Compare with the no approximation case: 4, 15, 89.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 20 / 21

Page 64: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Latent Semantic Indexing

An LSI Query ExampleLet’s use the NIPS dataset with k = 100.Then, the relative error of A100 and A in terms of the Frobenius norm,i.e., ‖A−A100‖F/‖A‖F was 0.6074, which is still large.Nonetheless, we get the relatively good performance.

Figure: With the best 100 term approximation, tol=0.2, 0.1, 0.05 correspond to0, 4, 72 returned documents; Compare with the no approximation case: 4, 15, 89.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 20 / 21

Page 65: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Latent Semantic Indexing

An LSI Query ExampleLet’s use the NIPS dataset with k = 100.Then, the relative error of A100 and A in terms of the Frobenius norm,i.e., ‖A−A100‖F/‖A‖F was 0.6074, which is still large.Nonetheless, we get the relatively good performance.

Figure: With the best 100 term approximation, tol=0.2, 0.1, 0.05 correspond to0, 4, 72 returned documents; Compare with the no approximation case: 4, 15, 89.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 20 / 21

Page 66: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Latent Semantic Indexing

We know that the vector u1 is the most dominant basis vectorrepresenting the range of the term space (i.e., the column space of A).Hence it is of our interest to check what terms u1 represents (notethat the entries of u1 are nonnegative for this matrix). The 10 termscorresponding to the largest entries of u1: ‘network’, ‘model’,‘learning’, ‘input’, ‘function’, ‘neural’, ‘set’, ‘training’, ‘data’, ‘unit’.Compare these with the top 10 frequently used terms: ‘network’,‘model’, ‘learning’, ‘function’, ‘input’, ‘neural’, ‘set’, ‘algorithm’,‘system’, ‘data’. As you can see, they are very close.Let’s check the entries of u2, which contains both positive andnegative values. The top 5 positive entries of u2: ‘network’, ‘unit’,‘input’, ‘neural’, ‘output’, while the top 5 negative entries of u2:‘model’, ‘data’, ‘algorithm’, ‘learning’, ‘parameter’.My interpretation: u2 tries to differentiate articles related toneuroscience from those related to machine learning algorithms.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 21 / 21

Page 67: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Latent Semantic Indexing

We know that the vector u1 is the most dominant basis vectorrepresenting the range of the term space (i.e., the column space of A).Hence it is of our interest to check what terms u1 represents (notethat the entries of u1 are nonnegative for this matrix). The 10 termscorresponding to the largest entries of u1: ‘network’, ‘model’,‘learning’, ‘input’, ‘function’, ‘neural’, ‘set’, ‘training’, ‘data’, ‘unit’.Compare these with the top 10 frequently used terms: ‘network’,‘model’, ‘learning’, ‘function’, ‘input’, ‘neural’, ‘set’, ‘algorithm’,‘system’, ‘data’. As you can see, they are very close.Let’s check the entries of u2, which contains both positive andnegative values. The top 5 positive entries of u2: ‘network’, ‘unit’,‘input’, ‘neural’, ‘output’, while the top 5 negative entries of u2:‘model’, ‘data’, ‘algorithm’, ‘learning’, ‘parameter’.My interpretation: u2 tries to differentiate articles related toneuroscience from those related to machine learning algorithms.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 21 / 21

Page 68: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Latent Semantic Indexing

We know that the vector u1 is the most dominant basis vectorrepresenting the range of the term space (i.e., the column space of A).Hence it is of our interest to check what terms u1 represents (notethat the entries of u1 are nonnegative for this matrix). The 10 termscorresponding to the largest entries of u1: ‘network’, ‘model’,‘learning’, ‘input’, ‘function’, ‘neural’, ‘set’, ‘training’, ‘data’, ‘unit’.Compare these with the top 10 frequently used terms: ‘network’,‘model’, ‘learning’, ‘function’, ‘input’, ‘neural’, ‘set’, ‘algorithm’,‘system’, ‘data’. As you can see, they are very close.Let’s check the entries of u2, which contains both positive andnegative values. The top 5 positive entries of u2: ‘network’, ‘unit’,‘input’, ‘neural’, ‘output’, while the top 5 negative entries of u2:‘model’, ‘data’, ‘algorithm’, ‘learning’, ‘parameter’.My interpretation: u2 tries to differentiate articles related toneuroscience from those related to machine learning algorithms.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 21 / 21

Page 69: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Latent Semantic Indexing

We know that the vector u1 is the most dominant basis vectorrepresenting the range of the term space (i.e., the column space of A).Hence it is of our interest to check what terms u1 represents (notethat the entries of u1 are nonnegative for this matrix). The 10 termscorresponding to the largest entries of u1: ‘network’, ‘model’,‘learning’, ‘input’, ‘function’, ‘neural’, ‘set’, ‘training’, ‘data’, ‘unit’.Compare these with the top 10 frequently used terms: ‘network’,‘model’, ‘learning’, ‘function’, ‘input’, ‘neural’, ‘set’, ‘algorithm’,‘system’, ‘data’. As you can see, they are very close.Let’s check the entries of u2, which contains both positive andnegative values. The top 5 positive entries of u2: ‘network’, ‘unit’,‘input’, ‘neural’, ‘output’, while the top 5 negative entries of u2:‘model’, ‘data’, ‘algorithm’, ‘learning’, ‘parameter’.My interpretation: u2 tries to differentiate articles related toneuroscience from those related to machine learning algorithms.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 21 / 21

Page 70: MAT 167: Applied Linear Algebra Lecture 22: Text ... - UC Davis Math

Latent Semantic Indexing

We know that the vector u1 is the most dominant basis vectorrepresenting the range of the term space (i.e., the column space of A).Hence it is of our interest to check what terms u1 represents (notethat the entries of u1 are nonnegative for this matrix). The 10 termscorresponding to the largest entries of u1: ‘network’, ‘model’,‘learning’, ‘input’, ‘function’, ‘neural’, ‘set’, ‘training’, ‘data’, ‘unit’.Compare these with the top 10 frequently used terms: ‘network’,‘model’, ‘learning’, ‘function’, ‘input’, ‘neural’, ‘set’, ‘algorithm’,‘system’, ‘data’. As you can see, they are very close.Let’s check the entries of u2, which contains both positive andnegative values. The top 5 positive entries of u2: ‘network’, ‘unit’,‘input’, ‘neural’, ‘output’, while the top 5 negative entries of u2:‘model’, ‘data’, ‘algorithm’, ‘learning’, ‘parameter’.My interpretation: u2 tries to differentiate articles related toneuroscience from those related to machine learning algorithms.

[email protected] (UC Davis) Text Mining May 19 & 22, 2017 21 / 21