Linear Algebraic Models in Information Retrieval Nathan Pruitt and Rami Awwad December 12th, 2016 Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 1 / 18
Linear Algebraic Models in Information Retrieval
Nathan Pruitt and Rami Awwad
December 12th, 2016
Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 1 / 18
Information Retrieval In a Nutshell
Information Retrieval– Defined as finding relevant information to asearch in a database containing documents, images, articles, etc.
Practical real life example– Finding an article or book in a librarythrough catalog system or through library’s database via search engine
Most common type are internet search engines a la Google, Yahoo,but also used on many other sites wherever there’s a search feature
Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 2 / 18
A Brief History of I.R. in the Digital Domain
S.M.A.R.T. (System for the Mechanical Analysis and Retrieval ofText) developed at Cornell University in the 1960s
Obtains legacy for the development of I.R. models including thevector space model
Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 3 / 18
The Vector Space Model
A text based ranking model common to internet search engines in theearly 1990s
Works by making a t× d matrix, where t can represent all terms in anEnglish dictionary
d representing the number of documents in a search engine database
Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 4 / 18
The Vector Space Model
d1 d2 d3 d4 d5 . . . d1,000,000
t1 m1,1 m1,2 m1,3 m1,4 m1,5 m1,1,000,000
t2 m2,1 m2,2 m2,3 m2,4 m2,5 m2,1,000,000
t3 m3,1 m3,2 m3,3 m3,4 m3,5 . . . m3,1,000,000
t4 m4,1 m4,2 m4,3 m4,4 m4,5 m4,1,000,000
t5 m5,1 m5,2 m5,3 m5,4 m5,5 m5,1,000,000
.
.
.
.
.
.
t300,000 m300,000,1 m300,000,2 m300,000,3 m300,000,4 m300,000,5 m300,000,1,000,000
Each m given a weight depending on number of times each term toccurs in document d , then weighed with an arithmetic weighingscheme
Weight allows comparison between document to document anddocument to query by the angles between their column vectors
Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 5 / 18
VSM: A Simpler Example
Mexample =
doc1 doc2 doc3
internet 38 14 20
graph 10 20 5
directed 0 2 10
Query =
term
internet 1
graph 1
directed 1
Entries called term frequencies
Term frequencies processed through arithmetic weighing schemebecause higher tf doesn’t necessarily mean a more relevant website
Engine considers query as a bag of words– order of terms eschewed
Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 6 / 18
Length Normalized t× d Matrix and Query Vector
Query∗ =
term
internet 1√3
graph 1√3
directed 1√3
Mexample∗ =
doc1 doc2 doc3
internet 0.790 0.630 0.659
graph 0.612 0.676 0.487
directed 0 0.382 0.573
After arithmetic scheme, matrix and query vector are lengthnormalized
Serves to simplify calculation of angles between document vectors,and between the document vectors and the query
Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 7 / 18
VSM: The ”Cosine Similarity”
cos(doc1, doc2) ≈
0.7900.6120
·0.6300.6760.382
‖doc1‖‖doc2‖ ≈ 0.912
1≈ 0.912
cos(doc1, doc3) ≈ 0.819
cos(doc2, doc3) ≈ 0.963
cos(Query , doc1) ≈ 0.810
cos(Query , doc2) ≈ 0.975
cos(Query , doc3) ≈ 0.993
These calculations imply the following angles separate each vector:
(doc1, doc2) ≈ arccos 0.912
(180◦
π
)≈ 24.188◦
(doc1, doc3) ≈ 34.985◦
(doc2, doc3) ≈ 15.530◦
(Query , doc1) ≈ 35.901◦
(Query , doc2) ≈ 12.918◦
(Query , doc3) ≈ 7.006◦
Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 8 / 18
VSM: Visualization of Document Vectors and their SharedAngles
Figure: Cosine similarity betweendoc1 to doc2 and doc2 to doc3
Figure: Cosine similarity betweendoc1 and doc3
Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 9 / 18
VSM: Visualization of Document Vectors and their SharedAngles with Query Vector
Figure: Cosine similarity betweendoc2 to the query and doc3 to query
Figure: Cosine similarity betweendoc1 and the query
Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 10 / 18
PageRank Algorithm
Google’s matrix has over 8 billion row and columns.
1 2
3 4 5
6 7
This directed graph represents the overall rankings of the websites.
This is a Markov Chain.
The arrows represent links between different websites.
For example, website 1 only links to website 2.
Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 11 / 18
PageRank Algorithm
P =
j1 j2 j3 j4 j5 j6 j7
i1 0 0 0 12 0 0 0
i2 1 0 12 0 1
214 0
i3 0 13 0 0 0 0 0
i4 0 13
12 0 0 1
4 0
i5 0 0 0 12 0 1
4 0
i6 0 13 0 0 1
2 0 0
i7 0 0 0 0 0 14 1
This matrix P shows the probabilities of movement between thesewebsites. Because website 1 only links to website 2, there is a 100percent chance of that move.
Matrix P is a transition matrix because the entries describe theprobability of a transition from state j to state i.
Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 12 / 18
PageRank Algorithm
Notice that each column vector in transition matrix P obtains entriesthat when added total 1. Therefore, all column vectors in P areprobability vectors.
Thus our transition matrix is also a stochastic matrix, whichdescribes a Markov chain with some interesting properties.
One of these properties state that all stochastic matrices have at leastone eigenvalue of 1. The eigenvector corresponding to 1 will tell usthe rank of our 7 websites, or in Google terms, the PageRank of eachwebsite.
Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 13 / 18
PageRank Algorithm
To approach this eigenvector, we calculate the steady-state vector xn ofour 7 website chain:
xn =
a1...aj...a7
All stochastic matrices have a steady-state vector. Our xn is a probabilityvector describing the chance of landing on each website after clickingthrough n links within our chain.
Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 14 / 18
PageRank Algorithm
We use this equation to compute steady-state vectors:
limn→∞
xn = Pnk x0
Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 15 / 18
Adjustment to Transition Matrix
Google is said to use a p with a value of 0.85. Then, we retrieve our Pnk as follows:
Pnk = 0.85
0 0 0 12
0 0 17
1 0 12
0 12
14
17
0 13
0 0 0 0 17
0 13
12
0 0 14
17
0 0 0 12
0 14
17
0 13
0 0 12
0 17
0 0 0 0 0 14
17
+ 0.15
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
=
0.02142 0.02142 0.02142 0.44642 0.02142 0.02142 0.14285
0.87142 0.02142 0.44642 0.02142 0.44642 0.23392 0.14285
0.02142 0.30476 0.02142 0.02142 0.02142 0.02142 0.14285
0.02142 0.30476 0.44642 0.02142 0.02142 0.23392 0.14285
0.02142 0.02142 0.02142 0.44642 0.02142 0.23392 0.14285
0.02142 0.30476 0.02142 0.02142 0.44642 0.02142 0.14285
0.02142 0.02142 0.02142 0.02142 0.02142 0.23392 0.14285
Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 16 / 18
Final Rank
limn→75
xn =
0.02142 0.02142 0.02142 0.44642 0.02142 0.02142 0.142850.87142 0.02142 0.44642 0.02142 0.44642 0.23392 0.142850.02142 0.30476 0.02142 0.02142 0.02142 0.02142 0.142850.02142 0.30476 0.44642 0.02142 0.02142 0.23392 0.142850.02142 0.02142 0.02142 0.44642 0.02142 0.23392 0.142850.02142 0.30476 0.02142 0.02142 0.44642 0.02142 0.142850.02142 0.02142 0.02142 0.02142 0.02142 0.23392 0.14285
n
0100000
xn =
0.1046310.2537670.1009530.1778280.1385980.1598570.063021
Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 17 / 18
Bibliography
1 Christopher D. Manning, Prabhankar Reghaven, Hinrich Schutze,Introduction to Information Retrieval
2 Michael W. Berry, Zlatko Drmac, Elizabeth R. Jessup. Matrices,Vector Spaces, and Information Retrieval
3 Raluca Tanase, Remus Redu. The Mathematics of Web Search
4 M.W. Berry, S.T. Dumais, G.W. O’Brien. Using Lienar Algebra forIntelligent Information Retrieval.
5 Howard Anton, Robert C. Busby. Contemporary Linear Algebra
Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 18 / 18