Top Banner
Linear Algebraic Models in Information Retrieval Nathan Pruitt and Rami Awwad December 12th, 2016 Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 1 / 18
18

Linear Algebraic Models in Information Retrieval€¦ · Information Retrieval In a Nutshell Information Retrieval{ De ned as nding relevant information to a search in a database

Sep 17, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Linear Algebraic Models in Information Retrieval€¦ · Information Retrieval In a Nutshell Information Retrieval{ De ned as nding relevant information to a search in a database

Linear Algebraic Models in Information Retrieval

Nathan Pruitt and Rami Awwad

December 12th, 2016

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 1 / 18

Page 2: Linear Algebraic Models in Information Retrieval€¦ · Information Retrieval In a Nutshell Information Retrieval{ De ned as nding relevant information to a search in a database

Information Retrieval In a Nutshell

Information Retrieval– Defined as finding relevant information to asearch in a database containing documents, images, articles, etc.

Practical real life example– Finding an article or book in a librarythrough catalog system or through library’s database via search engine

Most common type are internet search engines a la Google, Yahoo,but also used on many other sites wherever there’s a search feature

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 2 / 18

Page 3: Linear Algebraic Models in Information Retrieval€¦ · Information Retrieval In a Nutshell Information Retrieval{ De ned as nding relevant information to a search in a database

A Brief History of I.R. in the Digital Domain

S.M.A.R.T. (System for the Mechanical Analysis and Retrieval ofText) developed at Cornell University in the 1960s

Obtains legacy for the development of I.R. models including thevector space model

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 3 / 18

Page 4: Linear Algebraic Models in Information Retrieval€¦ · Information Retrieval In a Nutshell Information Retrieval{ De ned as nding relevant information to a search in a database

The Vector Space Model

A text based ranking model common to internet search engines in theearly 1990s

Works by making a t× d matrix, where t can represent all terms in anEnglish dictionary

d representing the number of documents in a search engine database

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 4 / 18

Page 5: Linear Algebraic Models in Information Retrieval€¦ · Information Retrieval In a Nutshell Information Retrieval{ De ned as nding relevant information to a search in a database

The Vector Space Model

d1 d2 d3 d4 d5 . . . d1,000,000

t1 m1,1 m1,2 m1,3 m1,4 m1,5 m1,1,000,000

t2 m2,1 m2,2 m2,3 m2,4 m2,5 m2,1,000,000

t3 m3,1 m3,2 m3,3 m3,4 m3,5 . . . m3,1,000,000

t4 m4,1 m4,2 m4,3 m4,4 m4,5 m4,1,000,000

t5 m5,1 m5,2 m5,3 m5,4 m5,5 m5,1,000,000

.

.

.

.

.

.

t300,000 m300,000,1 m300,000,2 m300,000,3 m300,000,4 m300,000,5 m300,000,1,000,000

Each m given a weight depending on number of times each term toccurs in document d , then weighed with an arithmetic weighingscheme

Weight allows comparison between document to document anddocument to query by the angles between their column vectors

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 5 / 18

Page 6: Linear Algebraic Models in Information Retrieval€¦ · Information Retrieval In a Nutshell Information Retrieval{ De ned as nding relevant information to a search in a database

VSM: A Simpler Example

Mexample =

doc1 doc2 doc3

internet 38 14 20

graph 10 20 5

directed 0 2 10

Query =

term

internet 1

graph 1

directed 1

Entries called term frequencies

Term frequencies processed through arithmetic weighing schemebecause higher tf doesn’t necessarily mean a more relevant website

Engine considers query as a bag of words– order of terms eschewed

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 6 / 18

Page 7: Linear Algebraic Models in Information Retrieval€¦ · Information Retrieval In a Nutshell Information Retrieval{ De ned as nding relevant information to a search in a database

Length Normalized t× d Matrix and Query Vector

Query∗ =

term

internet 1√3

graph 1√3

directed 1√3

Mexample∗ =

doc1 doc2 doc3

internet 0.790 0.630 0.659

graph 0.612 0.676 0.487

directed 0 0.382 0.573

After arithmetic scheme, matrix and query vector are lengthnormalized

Serves to simplify calculation of angles between document vectors,and between the document vectors and the query

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 7 / 18

Page 8: Linear Algebraic Models in Information Retrieval€¦ · Information Retrieval In a Nutshell Information Retrieval{ De ned as nding relevant information to a search in a database

VSM: The ”Cosine Similarity”

cos(doc1, doc2) ≈

0.7900.6120

·0.6300.6760.382

‖doc1‖‖doc2‖ ≈ 0.912

1≈ 0.912

cos(doc1, doc3) ≈ 0.819

cos(doc2, doc3) ≈ 0.963

cos(Query , doc1) ≈ 0.810

cos(Query , doc2) ≈ 0.975

cos(Query , doc3) ≈ 0.993

These calculations imply the following angles separate each vector:

(doc1, doc2) ≈ arccos 0.912

(180◦

π

)≈ 24.188◦

(doc1, doc3) ≈ 34.985◦

(doc2, doc3) ≈ 15.530◦

(Query , doc1) ≈ 35.901◦

(Query , doc2) ≈ 12.918◦

(Query , doc3) ≈ 7.006◦

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 8 / 18

Page 9: Linear Algebraic Models in Information Retrieval€¦ · Information Retrieval In a Nutshell Information Retrieval{ De ned as nding relevant information to a search in a database

VSM: Visualization of Document Vectors and their SharedAngles

Figure: Cosine similarity betweendoc1 to doc2 and doc2 to doc3

Figure: Cosine similarity betweendoc1 and doc3

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 9 / 18

Page 10: Linear Algebraic Models in Information Retrieval€¦ · Information Retrieval In a Nutshell Information Retrieval{ De ned as nding relevant information to a search in a database

VSM: Visualization of Document Vectors and their SharedAngles with Query Vector

Figure: Cosine similarity betweendoc2 to the query and doc3 to query

Figure: Cosine similarity betweendoc1 and the query

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 10 / 18

Page 11: Linear Algebraic Models in Information Retrieval€¦ · Information Retrieval In a Nutshell Information Retrieval{ De ned as nding relevant information to a search in a database

PageRank Algorithm

Google’s matrix has over 8 billion row and columns.

1 2

3 4 5

6 7

This directed graph represents the overall rankings of the websites.

This is a Markov Chain.

The arrows represent links between different websites.

For example, website 1 only links to website 2.

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 11 / 18

Page 12: Linear Algebraic Models in Information Retrieval€¦ · Information Retrieval In a Nutshell Information Retrieval{ De ned as nding relevant information to a search in a database

PageRank Algorithm

P =

j1 j2 j3 j4 j5 j6 j7

i1 0 0 0 12 0 0 0

i2 1 0 12 0 1

214 0

i3 0 13 0 0 0 0 0

i4 0 13

12 0 0 1

4 0

i5 0 0 0 12 0 1

4 0

i6 0 13 0 0 1

2 0 0

i7 0 0 0 0 0 14 1

This matrix P shows the probabilities of movement between thesewebsites. Because website 1 only links to website 2, there is a 100percent chance of that move.

Matrix P is a transition matrix because the entries describe theprobability of a transition from state j to state i.

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 12 / 18

Page 13: Linear Algebraic Models in Information Retrieval€¦ · Information Retrieval In a Nutshell Information Retrieval{ De ned as nding relevant information to a search in a database

PageRank Algorithm

Notice that each column vector in transition matrix P obtains entriesthat when added total 1. Therefore, all column vectors in P areprobability vectors.

Thus our transition matrix is also a stochastic matrix, whichdescribes a Markov chain with some interesting properties.

One of these properties state that all stochastic matrices have at leastone eigenvalue of 1. The eigenvector corresponding to 1 will tell usthe rank of our 7 websites, or in Google terms, the PageRank of eachwebsite.

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 13 / 18

Page 14: Linear Algebraic Models in Information Retrieval€¦ · Information Retrieval In a Nutshell Information Retrieval{ De ned as nding relevant information to a search in a database

PageRank Algorithm

To approach this eigenvector, we calculate the steady-state vector xn ofour 7 website chain:

xn =

a1...aj...a7

All stochastic matrices have a steady-state vector. Our xn is a probabilityvector describing the chance of landing on each website after clickingthrough n links within our chain.

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 14 / 18

Page 15: Linear Algebraic Models in Information Retrieval€¦ · Information Retrieval In a Nutshell Information Retrieval{ De ned as nding relevant information to a search in a database

PageRank Algorithm

We use this equation to compute steady-state vectors:

limn→∞

xn = Pnk x0

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 15 / 18

Page 16: Linear Algebraic Models in Information Retrieval€¦ · Information Retrieval In a Nutshell Information Retrieval{ De ned as nding relevant information to a search in a database

Adjustment to Transition Matrix

Google is said to use a p with a value of 0.85. Then, we retrieve our Pnk as follows:

Pnk = 0.85

0 0 0 12

0 0 17

1 0 12

0 12

14

17

0 13

0 0 0 0 17

0 13

12

0 0 14

17

0 0 0 12

0 14

17

0 13

0 0 12

0 17

0 0 0 0 0 14

17

+ 0.15

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

=

0.02142 0.02142 0.02142 0.44642 0.02142 0.02142 0.14285

0.87142 0.02142 0.44642 0.02142 0.44642 0.23392 0.14285

0.02142 0.30476 0.02142 0.02142 0.02142 0.02142 0.14285

0.02142 0.30476 0.44642 0.02142 0.02142 0.23392 0.14285

0.02142 0.02142 0.02142 0.44642 0.02142 0.23392 0.14285

0.02142 0.30476 0.02142 0.02142 0.44642 0.02142 0.14285

0.02142 0.02142 0.02142 0.02142 0.02142 0.23392 0.14285

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 16 / 18

Page 17: Linear Algebraic Models in Information Retrieval€¦ · Information Retrieval In a Nutshell Information Retrieval{ De ned as nding relevant information to a search in a database

Final Rank

limn→75

xn =

0.02142 0.02142 0.02142 0.44642 0.02142 0.02142 0.142850.87142 0.02142 0.44642 0.02142 0.44642 0.23392 0.142850.02142 0.30476 0.02142 0.02142 0.02142 0.02142 0.142850.02142 0.30476 0.44642 0.02142 0.02142 0.23392 0.142850.02142 0.02142 0.02142 0.44642 0.02142 0.23392 0.142850.02142 0.30476 0.02142 0.02142 0.44642 0.02142 0.142850.02142 0.02142 0.02142 0.02142 0.02142 0.23392 0.14285

n

0100000

xn =

0.1046310.2537670.1009530.1778280.1385980.1598570.063021

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 17 / 18

Page 18: Linear Algebraic Models in Information Retrieval€¦ · Information Retrieval In a Nutshell Information Retrieval{ De ned as nding relevant information to a search in a database

Bibliography

1 Christopher D. Manning, Prabhankar Reghaven, Hinrich Schutze,Introduction to Information Retrieval

2 Michael W. Berry, Zlatko Drmac, Elizabeth R. Jessup. Matrices,Vector Spaces, and Information Retrieval

3 Raluca Tanase, Remus Redu. The Mathematics of Web Search

4 M.W. Berry, S.T. Dumais, G.W. O’Brien. Using Lienar Algebra forIntelligent Information Retrieval.

5 Howard Anton, Robert C. Busby. Contemporary Linear Algebra

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 18 / 18