Linear Algebraic Models in Information Retrieval€¦ · Information Retrieval In a Nutshell Information Retrieval{ De ned as nding relevant information to a search in a database

Linear Algebraic Models in Information Retrieval

Nathan Pruitt and Rami Awwad

December 12th, 2016

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 1 / 18

Information Retrieval In a Nutshell

Information Retrieval– Defined as finding relevant information to asearch in a database containing documents, images, articles, etc.

Practical real life example– Finding an article or book in a librarythrough catalog system or through library’s database via search engine

Most common type are internet search engines a la Google, Yahoo,but also used on many other sites wherever there’s a search feature


A Brief History of I.R. in the Digital Domain

S.M.A.R.T. (System for the Mechanical Analysis and Retrieval ofText) developed at Cornell University in the 1960s

Obtains legacy for the development of I.R. models including thevector space model


The Vector Space Model

A text based ranking model common to internet search engines in theearly 1990s

Works by making a t× d matrix, where t can represent all terms in anEnglish dictionary

d representing the number of documents in a search engine database


The Vector Space Model

d1 d2 d3 d4 d5 . . . d1,000,000

t1 m1,1 m1,2 m1,3 m1,4 m1,5 m1,1,000,000

t2 m2,1 m2,2 m2,3 m2,4 m2,5 m2,1,000,000

t3 m3,1 m3,2 m3,3 m3,4 m3,5 . . . m3,1,000,000

t4 m4,1 m4,2 m4,3 m4,4 m4,5 m4,1,000,000

t5 m5,1 m5,2 m5,3 m5,4 m5,5 m5,1,000,000

.

.

.

.

.

.

t300,000 m300,000,1 m300,000,2 m300,000,3 m300,000,4 m300,000,5 m300,000,1,000,000

Each m given a weight depending on number of times each term toccurs in document d , then weighed with an arithmetic weighingscheme

Weight allows comparison between document to document anddocument to query by the angles between their column vectors


VSM: A Simpler Example

Mexample =

doc1 doc2 doc3

internet 38 14 20

graph 10 20 5

directed 0 2 10

Query =

term

internet 1

graph 1

directed 1

Entries called term frequencies

Term frequencies processed through arithmetic weighing schemebecause higher tf doesn’t necessarily mean a more relevant website

Engine considers query as a bag of words– order of terms eschewed


Length Normalized t× d Matrix and Query Vector

Query∗ =

term

internet 1√3

graph 1√3

directed 1√3

Mexample∗ =

doc1 doc2 doc3

internet 0.790 0.630 0.659

graph 0.612 0.676 0.487

directed 0 0.382 0.573

After arithmetic scheme, matrix and query vector are lengthnormalized

Serves to simplify calculation of angles between document vectors,and between the document vectors and the query


VSM: The ”Cosine Similarity”

cos(doc1, doc2) ≈

0.7900.6120

·0.6300.6760.382

‖doc1‖‖doc2‖ ≈ 0.912

1≈ 0.912

cos(doc1, doc3) ≈ 0.819

cos(doc2, doc3) ≈ 0.963

cos(Query , doc1) ≈ 0.810



These calculations imply the following angles separate each vector:

(doc1, doc2) ≈ arccos 0.912

(180◦

π

)≈ 24.188◦

(doc1, doc3) ≈ 34.985◦

(doc2, doc3) ≈ 15.530◦

(Query , doc1) ≈ 35.901◦

(Query , doc2) ≈ 12.918◦

(Query , doc3) ≈ 7.006◦


VSM: Visualization of Document Vectors and their SharedAngles

Figure: Cosine similarity betweendoc1 to doc2 and doc2 to doc3

Figure: Cosine similarity betweendoc1 and doc3


VSM: Visualization of Document Vectors and their SharedAngles with Query Vector

Figure: Cosine similarity betweendoc2 to the query and doc3 to query

Figure: Cosine similarity betweendoc1 and the query


PageRank Algorithm

Google’s matrix has over 8 billion row and columns.

1 2

3 4 5

6 7

This directed graph represents the overall rankings of the websites.

This is a Markov Chain.

The arrows represent links between different websites.

For example, website 1 only links to website 2.


PageRank Algorithm

P =

j1 j2 j3 j4 j5 j6 j7

i1 0 0 0 12 0 0 0

i2 1 0 12 0 1

214 0

i3 0 13 0 0 0 0 0

i4 0 13

12 0 0 1

4 0

i5 0 0 0 12 0 1

4 0

i6 0 13 0 0 1

2 0 0

i7 0 0 0 0 0 14 1

This matrix P shows the probabilities of movement between thesewebsites. Because website 1 only links to website 2, there is a 100percent chance of that move.

Matrix P is a transition matrix because the entries describe theprobability of a transition from state j to state i.


PageRank Algorithm

Notice that each column vector in transition matrix P obtains entriesthat when added total 1. Therefore, all column vectors in P areprobability vectors.

Thus our transition matrix is also a stochastic matrix, whichdescribes a Markov chain with some interesting properties.

One of these properties state that all stochastic matrices have at leastone eigenvalue of 1. The eigenvector corresponding to 1 will tell usthe rank of our 7 websites, or in Google terms, the PageRank of eachwebsite.


PageRank Algorithm

To approach this eigenvector, we calculate the steady-state vector xn ofour 7 website chain:

xn =

a1...aj...a7

All stochastic matrices have a steady-state vector. Our xn is a probabilityvector describing the chance of landing on each website after clickingthrough n links within our chain.


PageRank Algorithm

We use this equation to compute steady-state vectors:

limn→∞

xn = Pnk x0


Adjustment to Transition Matrix

Google is said to use a p with a value of 0.85. Then, we retrieve our Pnk as follows:

Pnk = 0.85

0 0 0 12

0 0 17

1 0 12

0 12

14

17

0 13

0 0 0 0 17

0 13

12

0 0 14

17

0 0 0 12

0 14

17

0 13

0 0 12

0 17

0 0 0 0 0 14

17

+ 0.15

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

17

=

0.02142 0.02142 0.02142 0.44642 0.02142 0.02142 0.14285

0.87142 0.02142 0.44642 0.02142 0.44642 0.23392 0.14285

0.02142 0.30476 0.02142 0.02142 0.02142 0.02142 0.14285

0.02142 0.30476 0.44642 0.02142 0.02142 0.23392 0.14285

0.02142 0.02142 0.02142 0.44642 0.02142 0.23392 0.14285

0.02142 0.30476 0.02142 0.02142 0.44642 0.02142 0.14285

0.02142 0.02142 0.02142 0.02142 0.02142 0.23392 0.14285


Final Rank

limn→75

xn =

0.02142 0.02142 0.02142 0.44642 0.02142 0.02142 0.142850.87142 0.02142 0.44642 0.02142 0.44642 0.23392 0.142850.02142 0.30476 0.02142 0.02142 0.02142 0.02142 0.142850.02142 0.30476 0.44642 0.02142 0.02142 0.23392 0.142850.02142 0.02142 0.02142 0.44642 0.02142 0.23392 0.142850.02142 0.30476 0.02142 0.02142 0.44642 0.02142 0.142850.02142 0.02142 0.02142 0.02142 0.02142 0.23392 0.14285

n

0100000

xn =

0.1046310.2537670.1009530.1778280.1385980.1598570.063021


Bibliography

1 Christopher D. Manning, Prabhankar Reghaven, Hinrich Schutze,Introduction to Information Retrieval

2 Michael W. Berry, Zlatko Drmac, Elizabeth R. Jessup. Matrices,Vector Spaces, and Information Retrieval

3 Raluca Tanase, Remus Redu. The Mathematics of Web Search

4 M.W. Berry, S.T. Dumais, G.W. O’Brien. Using Lienar Algebra forIntelligent Information Retrieval.

5 Howard Anton, Robert C. Busby. Contemporary Linear Algebra


Linear Algebraic Models in Information Retrieval€¦ · Information Retrieval In a Nutshell Information Retrieval{ De ned as nding relevant information to a search in a database

Documents