HITS e Pagerank

IntroduzioneI sistemi di ritrovamento delle informazioni precedentemente visti come il VSM e lʼLSI non possono essere utilizzati nel caso delle pagine web, in quanto il costo computazionale e l'immagazzinamento dei dati comporta un grande lavoro.Esistono principalmente 3 sistemi di strutturazione del web:• HITS(Hypertext induced Topic search)• PageRank• SALSA

HITSOgni pagina nel web viene rappresentata come un nodo e ciascun nodo viene interconnesso ad un altro attraverso degli archi direzionati, i quali rappresentano i collegamenti realizzati con i link.

Nellʼimmagine si possono individuare 6 pagine web che sono interconnesse tra loro attraverso gli archi direzionati.Il metodo HITS definisce due tipologie di nodi:• Authorities, è un documento avente link in ingresso• Hubs, è un documento avente link in uscita

Hits si basa su un semplice concetto “buoni hubs sono puntati da buoni authorities e buoni authotiries sono puntati da buoni hubs. Hits definisce per ogni nodo sia il relativo punteggio di authorities e sia di hubs.

2 A. N. LANGVILLE AND C. D. MEYER

redundant documents, broken links, and some very poor quality documents. TheWeb also is subject to frequent updates as pages are modified and/or added to ordeleted from the Web on an almost continual basis. The Web’s volatility leaves IRresearchers with two choices: either incorporate updates and downdates on a frequent,regular basis or make updates infrequently, trading accuracy for simplicity. The Webis also an interesting document collection in that some of its users, aiming to exploitthe mercantile potential of the Web, intentionally try to deceive IR systems [42]. Forexample, there are papers instructing webpage authors on the methods for “increasingone’s ranking” on various IR systems [54]. Ideally, an IR system should be imperviousto such spamming. The tendencies of Web users also present additional challengesto Web IR systems. Web users generally input very short queries, rarely make useof feedback to revise the search, seldom use any advanced features of the systemand almost always view only the top 10-20 retrieved documents [5, 31]. Such usertendencies put high priority on the speed and accuracy of the IR system. The finalfeature, and most important for this paper, is the Web’s unique hyperlink structure.This inherent structure provides extra, and as will become clear later, very valuableinformation that can be exploited to improve IR methods.

This hyperlink structure is exploited by three of the most frequently cited Web IRmethods: HITS (Hypertext Induced Topic Search) [36], PageRank [14, 15] and SALSA[41]. HITS was developed in 1997 by Jon Kleinberg. Soon after Sergey Brin and LarryPage developed their now famous PageRank method. SALSA was developed in 2000in reaction to the pros and cons of HITS and PageRank.

This paper is organized as follows. Sections 2-4 cover HITS and PageRank andtheir connections, followed by SALSA in section 5. Other Web IR methods are brieflyoverviewed in section 6 and section 7 contains predictions for the future of Web IR.

2. Two Theses for exploiting the Hyperlink Structure of the Web. Eachpage/document on the Web is represented as a node in a very large graph. Thedirected arcs connecting these nodes represent the hyperlinks between the documents.The graph of a sample Web is depicted in Figure 2.1.

2

3

4 5

6

1

Fig. 2.1. Hyperlink Structure of a 6-node sample Web

The HITS IR method defines authorities and hubs. An authority is a documentwith several inlinks, while a hub has several outlinks. See Figure 2.2.

The HITS thesis is that good hubs point to good authorities and good authoritiesare pointed to by good hubs. HITS assigns both a hub score and authority score to

EIGENVECTOR METHODS FOR WEB INFORMATION RETRIEVAL 3

AuthHub

Fig. 2.2. An authority node and a hub node

each webpage. For example, the higher the authority score of a particular page, themore authoritative that document is.

Weakpage

Good page

Fig. 2.3. The PageRank thesis (The bold lines show the extra weight given to links fromimportant pages.)

On the other hand, PageRank uses the hyperlink structure of the Web to viewinlinks into a page as a recommendation of that page from the author of the inlinkingpage. However, inlinks from good pages (highly revered authors) should carry muchmore weight than inlinks from marginal pages. Each webpage can be assigned anappropriate PageRank score, which measures the importance of that page. Figure2.3 depicts the PageRank thesis. The bold lines show the extra weight given to linksfrom important pages.

These are two very similar and related, yet distinct, ideas for ranking the use-fulness of webpages. In the next few sections, we analyze these two IR methods inturn.

3. HITS. We repeat the HITS thesis: good authorities are pointed to by goodhubs and good hubs point to good authorities. Page i has both an authority score xi

Preso un nodo, possiamo definire punteggio di authorities come xi e punteggio di hubs come yi. Preso E , insieme di archi che connettono i vari nodi della struttura del web, definiamo eij ,arco direzionato dal nodo i al nodo j. Detto ciò per ogni pagina è possibile assegnare un valore iniziale di punteggio di hubs e authorities ,definiti come yi(0) e xi(0),Hits definisce i successivi punteggi come:

Questa equazione può essere scritta in forma matriciale , attraverso lʼutilizzo di una particolare matrice di adiacenza denominata L che rappresenta il grafo del web.

Per esempio considerando un piccolo grafo di questo tipo:

La matrice di adiacenza L è la seguente:

Introdotta la matrice di adiacenza possiamo riscrivere le equazioni di calcolo dei punteggi in forma matriciale come:

Questo tipo di procedimento è un metodo iterativo per il calcolo dei punteggi di hubs e authorities.Esprimiamo quello che è lʼalgoritmo interativo:1. Impostiamo il vettore y(0) con elementi tutti pari a 12. Fintanto che vi è convergenza, esegui" " " " Calcola x(k)" " " " Calcola y(k)" " " " K=k+1" " " " Normalizza x(k) e y(k)Dal passo 2 dellʼ algoritmo sappiamo che:


and a hub score yi. Let E be the set of all directed edges in the Web graph and let eij

represent the directed edge from node i to node j. Given that each page has somehowbeen assigned an initial authority score x(0)

i and hub score y(0)i , HITS successively

refines these scores by computing

x(k)i =

!

j:eji!E

y(k"1)j and y(k)

i =!

j:eij!E

x(k)j for k = 1, 2, 3, . . . .(3.1)

These equations can be written in matrix form with the help of the adjacencymatrix L of the directed Web graph.

Lij ="

1, if there exists an edge from node i to node j,0, otherwise.

For example, consider the small graph in Figure 3.1 with corresponding adjacency

1 2

3 4

Fig. 3.1. Node-link graph for small 4-node Web

matrix L.

L =

#

$$%

d1 d2 d3 d4

d1 0 1 1 0d2 1 0 1 0d3 0 1 0 1d4 0 1 0 0

&

''(.

In matrix notation, the equations in (3.1) assume the form

x(k) = LT y(k"1) and y(k) = Lx(k).

This leads to the following iterative algorithm for computing the ultimate author-ity and hub scores x and y.

1. Initialize: y(0) = e, where e is a column vector of all ones. Other positivestarting vectors may be used. (See section 3.2.)

2. Until convergence, do

x(k) = LT y(k"1)

y(k) = Lx(k)

k = k + 1Normalize x(k) and y(k) (see section 3.2).






x(k)i =

!

j:eji!E

y(k"1)j and y(k)

i =!

j:eij!E

x(k)j for k = 1, 2, 3, . . . .(3.1)


Lij ="



1 2

3 4


matrix L.

L =

#

$$%

d1 d2 d3 d4

d1 0 1 1 0d2 1 0 1 0d3 0 1 0 1d4 0 1 0 0

&

''(.






x(k) = LT y(k"1)

y(k) = Lx(k)







x(k)i =

!

j:eji!E

y(k"1)j and y(k)

i =!

j:eij!E

x(k)j for k = 1, 2, 3, . . . .(3.1)


Lij ="



1 2

3 4


matrix L.

L =

#

$$%

d1 d2 d3 d4

d1 0 1 1 0d2 1 0 1 0d3 0 1 0 1d4 0 1 0 0

&

''(.






x(k) = LT y(k"1)

y(k) = Lx(k)







x(k)i =

!

j:eji!E

y(k"1)j and y(k)

i =!

j:eij!E

x(k)j for k = 1, 2, 3, . . . .(3.1)


Lij ="



1 2

3 4


matrix L.

L =

#

$$%

d1 d2 d3 d4

d1 0 1 1 0d2 1 0 1 0d3 0 1 0 1d4 0 1 0 0

&

''(.






x(k) = LT y(k"1)

y(k) = Lx(k)







x(k)i =

!

j:eji!E

y(k"1)j and y(k)

i =!

j:eij!E

x(k)j for k = 1, 2, 3, . . . .(3.1)


Lij ="



1 2

3 4


matrix L.

L =

#

$$%

d1 d2 d3 d4

d1 0 1 1 0d2 1 0 1 0d3 0 1 0 1d4 0 1 0 0

&

''(.






x(k) = LT y(k"1)

y(k) = Lx(k)


attraverso un passo è possibile semplificare le due equazioni come:

queste due equazioni utilizzano il metodo delle potenze per ottenere lʼ autovettore (destro) dominante delle due matrici Lt*L e L*Lt. La matrice Lt*L viene definita come matrice di authorities e la matrice L*Lt viene definita come matrice di hubs. Entrambe le matrici sono simmetriche semi-definite positive.

Implementazione di HITSPer implementare lʼalgoritmo HITS è necessario definire 2 fasi:1. Realizzazione del grafo delle vicinanze(N) rispetto ad una query.2. Calcolo dei punteggi di Hubs e authorities per ogni pagina presente nel grafo N.3. Visualizzazione di due classifiche di ordinamento seguendo i punteggi di hubs e

authorities.

Iniziamo dal primo passo, la generazione del grafo delle pagine avviene attraverso la selezione di tutti i documenti aventi termini corrispondenti a quelli utilizzati nella query. Esistono vari algoritmi che permettono di generare lʼassociazione termini-documenti nel caso del web. Uno dei più semplici è il metodo di generazione della matrice termini-documenti inversa. Il metodo potrebbe generare una struttura di questo tipo:

Per ogni termine vengono definiti quelli che sono i documenti che fanno uso. Definiti quelli che sono i documenti che sono ritenuti rilevanti, si genera il grafo tenendo presente anche le varie interconnessioni che vi sono tra i vari documenti presi in considerazione. Questo grafo viene espanso attraverso lʼinserimento di ulteriori nodi che sono puntati o che puntano ai nodi definiti in precedenza. Questa espansione del grafo viene principalmente utilizzato per ovviare al problema della sinonimia. Lʼespansione deve avvenire dopo aver definito il numero massimo di nodi che possono essere aggiunti al grafo minimo, in quanto se non viene definito un limite lʼespansione può portare alla generazione di un grafo molto grande.Dopo aver definito il grafo si passa alla creazione della matrice di adiacenza L. Lʼordine di grandezza della matrice L è di gran lunga inferiore rispetto alla dimensione dellʼintero grafo rappresentante la dimensione del web. Dopo di che si passa al calcolo dellʼ autovettore dominante delle due matrici di authorities e di hubs.


Note that in step 2 of this algorithm, the two equations

x(k) = LT y(k!1)

y(k) = Lx(k)

can be simplified by substitution to

x(k) = LT Lx(k!1)

y(k) = LLT y(k!1).

These two new equations define the iterative power method for computing the dom-inant eigenvector for the matrices LT L and LLT . Since the matrix LT L determinesthe authority scores, it is called the authority matrix and LLT is known as the hubmatrix. LT L and LLT are symmetric positive semidefinite matrices. Computing theauthority vector x and the hub vector y can be viewed as finding dominant right-handeigenvectors of LT L and LLT , respectively.

3.1. HITS Implementation. The implementation of HITS involves two mainsteps. First, a neighborhood graph N related to the query terms is built. Second, theauthority and hub scores for each document in N are computed and two ranked listsof the most authoritative documents and most “hubby” documents are presented tothe IR user. Since the second step was described in the previous section, we focus onthe first step. All documents containing references to the query terms are put into theneighborhood graph N . There are various ways to determine these documents. Onesimple method consults the inverted term-document file. This file might look like:

• aardvark: term 1 - doc 3, doc 117, doc 3961

•...

• aztec: term 10 - doc 15, doc 3, doc 101, doc 19, doc 1199, doc 673• baby: term 11 - doc 56, doc 94, doc 31, doc 3

•...

• zymurgy: term m - doc 223For each term, the documents mentioning that term are stored in list form. Thus,a query on terms 10 and 11 would pull documents 15, 3, 101, 19, 1199, 673, 56,94, and 31 into N . Next, the graph around the subset of nodes in N is expandedby adding nodes that either point to nodes in N or are pointed to by nodes in N .This expansion allows some latent semantic associations to be made. That is, for thequery term car, with the expansion about documents containing car, some documentscontaining automobile may now be added to N , hopefully resolving the problem ofsynonyms. However, the set N can become very large due to the expansion process;a document containing the query terms may possess a huge indegree or outdegree.Thus, in practice, the maximum number of inlinking nodes and outlinking nodesto add for a particular node in N is fixed, at say 100. For example, only the first100 outlinking nodes of a document containing a query term are added to N . Theprocess of building the neighborhood graph is strongly related to building level setsin information filtering, which reduces a sparse matrix to a much smaller more query-relevant matrix [63].

Once the set N is built, the adjacency matrix L corresponding to the nodes in Nis formed. The order of L is much smaller than the total number of nodes/documentson the Web. Therefore, computing authority and hub scores using the dominant



x(k) = LT y(k!1)

y(k) = Lx(k)


x(k) = LT Lx(k!1)

y(k) = LLT y(k!1).




•...


•...





x(k) = LT y(k!1)

y(k) = Lx(k)


x(k) = LT Lx(k!1)

y(k) = LLT y(k!1).




•...


•...



Convergenza di HitsCome abbiamo precedentemente visto lʼalgoritmo HITS non è altro che il metodo delle potenze applicato alle matrici Lt*L e L*Lt. Dato che le matrici Lt*L e L*Lt sono simmetriche, semidefinite positive e non negative,con autovalori tutti distinti(molteplicità algebrica pari a 1) il metodo delle potenze porta alla convergenza.

Punti di forza e punti deboli di HITSVantaggi:• Doppia classificazione sia in base al punteggio di hubs e sia in base al punteggio di

authorities• La dimensione delle matrici di hubs e authorities sono molto più piccole del numero di

documenti presenti sul web

Svantaggi:• Dipendenza dalla query• Suscettibile allo spamming• Problema del topic drift

PageRankIn Pagerank si assegna, prima dellʼesecuzione di una query, un punteggio, legato allʼimportanza di una pagina; in questa maniera, una volta eseguita la query, allʼutente viene mostrata una lista ordinata di pagine correlate ai termini inseriti nella query.Lʼimportanza di una pagina è determinata dai “voti”, ovvero link che puntano alla pagina in questione. Lʼidea di base è che i voti (links) da siti importanti dovrebbero avere più peso nel calcolo del punteggio rispetto a voti (links) provenienti da siti meno importanti, e che lʼimportanza di un voto sia “scalato” in base al numero di link uscenti dalla sorgente.

Indicando dunque con P una pagina generica appartenente allʼinsieme di pagine B, il punteggio (r) della i-esima pagina P sarà dato da:


means that nodes 1 and 2 share a common outlink node, node 3. The (4, 2)-elementimplies that nodes 4 and 2 do not share a common outlink node. Ding et al. use theserelationships between authority and co-citation and hubs and co-reference to claimthat simple inlink ranking provides a decent approximation to the HITS authorityscore and simple outlink ranking provides a decent approximation to hub ranking[23, 22].

We close this section by noting that HITS has been incorporated into the CLEVERproject at IBM Almaden Research Center [1]. HITS is also part of the underlyingranking technology used by the search engine Teoma.

4. PageRank. In 1998 (the same year that HITS was born), Google foundersLarry Page and Sergey Brin formulated their PageRank concept and made it thebasis for their search engine [15]. As stated on the Google web site, “The heart ofour software is PageRanktm · · · [it] provides the basis for all of our web search tools.”By avoiding the inherent weaknesses of HITS, PageRank has been responsible forelevating Google to the position of the world’s preeminent search engine.

After web pages retrieved by robot crawlers are indexed and cataloged, PageRankvalues are assigned prior to query time according to perceived importance so that atquery time a ranked list of pages related to the query terms can be presented tothe user almost instantaneously. PageRank importance is determined by “votes”in the form of links from other pages on the web. The idea is that votes (links)from important sites should carry more weight than votes (links) from less importantsites, and the significance of a vote (link) from any source should be tempered (orscaled) by the number of sites the source is voting for (linking to). These notions areencapsulated by defining the rank r(P ) of a given page P to be

r(P ) =!

Q!BP

r(Q)|Q| , where

BP = {all pages pointing to P},|Q| = number of out links from Q.

This is a recursive definition, so computation necessarily requires iteration. If there aren pages, P1, P2, . . . , Pn, arbitrarily assign each page an initial ranking, say, r0(Pi) =1/n, then successively refine the ranking by computing

rj(Pi) =!

Q!BPi

rj"1(Q)|Q| , for j = 1, 2, 3, . . . .

This is accomplished by setting !Tj =

"rj(P1), rj(P2), · · · , rj(Pn)

#, and iteratively

computing

!Tj = !T

j"1P, where P is the matrix with pij =$

1/|Pi| if Pi links to Pj ,

0 otherwise.(4.1)

The notation |Pi| is the number of outlinks from page Pi. Again, this is the powermethod. If the limit exists, the PageRank vector is defined to be !T = limj#$ !T

j ,and the ith component !i is the PageRank of Pi. This is the raw idea, but for boththeoretical and practical reasons (e.g., ensuring convergence, customizing rankings,and adjusting convergence rates), the matrix P must be adjusted—how adjustmentsare made is described in §4.3.

4.1. Markov Model of the Web. The “raw” Google matrix P is nonnegativewith row sums equal to one or zero. Zero row sums correspond to pages that that

Eʼ chiaramente un processo ricorsivo, in quanto per calcolare il punteggio al passo j-esimo, serve calcolare il punteggio delle pagine Q al passo j-1esimo.Questo rende dunque il processo iterativo.Il vettore dei punteggi associati a tutte le pagine, chiamato vettore di PageRank, sarà dunque:






r(P ) =!

Q!BP

r(Q)|Q| , where



rj(Pi) =!

Q!BPi

rj"1(Q)|Q| , for j = 1, 2, 3, . . . .


"rj(P1), rj(P2), · · · , rj(Pn)

#, and iteratively

computing

!Tj = !T



0 otherwise.(4.1)




e si calcolerà iterativamente come






r(P ) =!

Q!BP

r(Q)|Q| , where



rj(Pi) =!

Q!BPi

rj"1(Q)|Q| , for j = 1, 2, 3, . . . .


"rj(P1), rj(P2), · · · , rj(Pn)

#, and iteratively

computing

!Tj = !T



0 otherwise.(4.1)




dove






r(P ) =!

Q!BP

r(Q)|Q| , where



rj(Pi) =!

Q!BPi

rj"1(Q)|Q| , for j = 1, 2, 3, . . . .


"rj(P1), rj(P2), · · · , rj(Pn)

#, and iteratively

computing

!Tj = !T



0 otherwise.(4.1)




Correggere la matrice P

La matrice P, così ottenuta, può presentare un problema particolare: una o più righe della matrice può avere somma pari a 0.Questo significa che alcuni nodi nella rete non hanno alcun outlink; tali nodi sono chiamati in letteratura come “nodi dangling”.Questo problema viene risolto sostituendo queste righe con il vettore et/n dove n è lʼordine della matrice P.

La nuova matrice è una matrice stocastica, ovvero:• la somma lungo le righe è pari a 1• ha lʼautovalore dominante pari a 1

Ne segue che, se lʼiterazione per il calcolo del vettore di PageRank converge, converge allʼautovettore sinistro normalizzato πT che soddisfi la seguente relazione:


have no outlinks—such pages are sometimes referred to as dangling nodes. If we arewilling to assume for a moment that there are no dangling nodes or that they areaccounted for by artificially adding appropriate links to make all row sums equal one,then P is a row stochastic matrix, which in turn means that the PageRank iteration(4.1) represents the evolution of a Markov chain [45, Chapt. 8]. More precisely, thisMarkov chain is a random walk on the graph defined by the link structure of the webpages in Google’s database.

For example, consider the hyperlink structure of Tiny Web consisting of six web-pages linked as in Figure 2.1. The Markov model represents Tiny Web’s directedgraph as a square transition probability matrix P whose element pij is the probabil-ity of moving from state i (page i) to state j (page j) in one step (click). For example,assume that, starting from any node (webpage), it is equally likely to follow any ofthe outgoing links to arrive at another node. Thus,

P =

!

"""""#

0 .5 .5 0 0 0.5 0 .5 0 0 00 .5 0 .5 0 00 0 0 0 .5 .50 0 .5 .5 0 00 0 0 0 1 0

$

%%%%%&,

which is Tiny Web’s raw (unadjusted) “Google matrix” described in (4.1). However,other suitable probability distributions may be used across the rows. For example, ifweb usage logs show that users accessing page 2 are twice as likely to jump to page 1as they are to jump to page 3, then pT

2 (the second row of P) might alternately bedefined as

pT2 = ( .6667 0 .3333 0 0 0 ) .

Other ideas for filling in the entries of P have been suggested [62, 52].In general, the dominant eigenvalue for every stochastic matrix P is ! = 1.

Consequently, if the PageRank iteration (4.1) converges, it converges to the normalizedleft-hand eigenvector !T satisfying

!T = !T P, !T e = 1, (e is a column of ones),(4.2)

which is the stationary (or steady-state) distribution of the Markov chain [45, Chapt.8]. This is why Google intuitively characterizes the PageRank value of each site asthe long-run proportion of time spent at that site by a web surfer eternally clickingon links at random.1

4.2. Computing PageRank. Since the computation of PageRank boils downto solving either the eigenvector problem (4.2) or, equivalently, solving the homoge-neous linear system !T (I!P) = 0 with !T e = 1, determining PageRank might seemlike a rather easy task. But quite to the contrary, the size of the problem (there arecurrently almost 4,300,000,000 pages in Google’s database) severely limits the choiceof algorithms that can be e!ectively used to compute !T . In fact, this computationhas been called “the world’s largest matrix computation” [47]. Direct methods (eventhose tuned for sparsity) as well as eigensolvers can’t handle the overwhelming size,and variants of the power method seem to be the only practical choices. The timerequired by Google to compute the PageRank vector has been reported to be on theorder of several days.

1Clicking BACK or entering a URL on the command line is excluded in this model.

Implementazione di PageRank

Dato che il calcolo del vettore di PageRank si riduce allʼindividuazione dellʼautovettore dominante sinistro per la matrice P o, equivalentemente, alla risoluzione del sistema lineare omogeneo πT (I - P) = 0 con πT e = 1, può sembrare che determinarlo sia un compito semplice. Ma, al contrario, la dimensione del problema (si pensi che ci sono quasi 5.000.000.000 di pagine nel database di Google) limita notevolmente la scelta degli algoritmi che possono essere usati efficientemente per il calcolo di πT. Inoltre, anche se stocastica, la matrice


4.3. Adjusting P . There are a few problems with strictly using the hyperlinkstructure of the Web to build a transition probability matrix that will adequatelydefine PageRank. First, as noted earlier, the raw Google matrix P can fail to be astochastic matrix because P has a row of all zeros for each node with zero outdegree.This is easily remedied by replacing each zero row with eT /n, where n is the order ofP. Call this new matrix P̄. But this alone doesn’t fix all of the problems.

Another greater di!culty can (and usually does) arise: P̄ may be a reducible ma-trix because the underlying chain is reducible. Reducible chains are those that containsets of states in which the chain eventually becomes trapped—i.e., by a reordering ofstates the transition matrix of a reducible chain can be made to have the canonicalform

P =! S1 S2

S1 T11 T12

S2 0 T22

".(4.3)

Once a state in set S2 has been reached, the chain never returns to the states of S1.For example, if web page Pi contains only a link to page Pj , and Pj contains onlya link to Pi, then Google’s random surfer who hits either Pi or Pj is trapped intobouncing between these two pages endlessly, which is the essence of reducibility.

An irreducible Markov chain is one in which every state is eventually reachablefrom every other state. That is, there exists a path from node i to node j for all i, j.Irreducibility is a desirable property because it is precisely the feature that guaranteesthat a Markov chain possesses a unique (and positive) stationary distribution vector!T = !T P—it’s the Perron-Frobenius theorem at work [45, Chapt. 8].

The modification of the raw Google matrix P leading to P̄ as described earlierproduces a stochastic matrix, but the structure of the world wide web is such that P̄is almost certainly reducible. Hence further adjustment is necessary in order ensureirreducibility. Brin and Page force irreducibility into the picture by making everystate directly reachable from every other state. They originally did so by adding aperturbation matrix E = eeT /n to P̄ to form

¯̄P = !P̄ + (1 ! !)E,

where ! is a scalar between 0 and 1. The Google reasoning was that this stochasticmatrix models a web surfer’s “teleportation” tendency to randomly jump to a newpage by entering a URL on the command line, and it assumes that each URL hasan equal likelihood of being selected. Later Google adopted a more realistic and lessdemocratic stance by using a better (and more flexible) perturbation matrix

E = evT ,

where the “personalization” vector vT > 0 is a probability vector that allows non-uniform probabilities for teleporting to particular pages. More importantly, at leastfrom a business viewpoint, taking the perturbation to be of the form E = evT permits“intervention” by fiddling with vT to adjust PageRank values up or down according tocommercial considerations [60]. Other perturbation terms may be used as well, but,in any case, ¯̄P = !P̄ + (1 ! !)E is a convex combination of two stochastic matricesP and E such that ¯̄P is stochastic and irreducible and hence ¯̄P possesses a uniquestationary distribution !T . It’s the matrix ¯̄P that is now generally called “the Googlematrix” and its stationary distribution !T is the real PageRank vector.

può essere riducibile, ovvero al suo interno si possono individuare delle sottomatrici con elementi tutti pari a 0.La struttura del web è tale da rendere la matrice




P =! S1 S2

S1 T11 T12

S2 0 T22

".(4.3)




¯̄P = !P̄ + (1 ! !)E,


E = evT ,


sicuramente riducibile, questo significa che bisogna ulteriormente correggerla.

I creatori di Google, inizialmente, scelsero di forzare lʼirriducibilità della matrice perturbandola ulteriormente:




P =! S1 S2

S1 T11 T12

S2 0 T22

".(4.3)




¯̄P = !P̄ + (1 ! !)E,


E = evT ,


con




P =! S1 S2

S1 T11 T12

S2 0 T22

".(4.3)




¯̄P = !P̄ + (1 ! !)E,


E = evT ,


e il parametro α scalare, compreso fra 0 e 1.




P =! S1 S2

S1 T11 T12

S2 0 T22

".(4.3)




¯̄P = !P̄ + (1 ! !)E,


E = evT ,


Successivamente Google decise di utilizzare una matrice di perturbazione differente:




P =! S1 S2

S1 T11 T12

S2 0 T22

".(4.3)




¯̄P = !P̄ + (1 ! !)E,


E = evT ,


dove vT è chiamato “vettore di personalizzazione” ed è tale che vT > 0. Lʼuso di questa particolare matrice permette agli ingegneri di Google di intervenire per “correggere” il vettore di PageRank, ad esempio per promuovere una pagina sponsorizzata o per punire una pagina che cerca di alterare il proprio punteggio.

Implementazione di PageRank

Bisogna notare che PageRank associa un punteggio basato sullʼimportanza a ciascuna pagina Web, e non basato sulla rilevanza.La sua implementazione prevede due step fontamentali:

1. Si determinano gli insiemi di nodi che contengono i termini inseriti nella query2. I nodi restituiti vengono riordinati a seconda del punteggio di Pagerank e restituiti

allʼutente

In particolare, per il calcolo del vettore di PageRank:

I. Si specifica un particolare valore per α;II. si imposta III. si itera fino al raggiungimento del grado desiderato per la convergenza

Convergenza di PageRank

Dato che la matrice


For an irreducible stochastic matrix, there is only one eigenvalue on the unit circle,all other eigenvalues have modulus strictly less than one [45]. This means that thepower method applied to an irreducible stochastic matrix P is guaranteed to convergeto the unique dominant eigenvector—the stationary vector !T for the Markov matrixand the PageRank vector for the Google matrix. This is one reason why Brin andPage added the fudge factor matrix forcing irreducibility. Unlike HITS, there are nowno issues with uniqueness of the ranking vector, and any positive probability vectorcan be used to start the iteration.

Even though the power method applied to the irreducible stochastic matrix ¯̄Pconverges to a unique PageRank vector, the rate of convergence is a crucial issue,especially considering the scope of the matrix-vector multiplications—it’s on the orderof billions, since unlike HITS, PageRank operates on Google’s version of the full web.As alluded to earlier, the asymptotic rate of convergence of (4.5) is governed by therate at which !k

2 ! 0, so, in light of (4.4), the asymptotic rate of convergence is therate at which ("µ2)k ! 0, regardless of the value of the personalization vector vT

in E = evT . The structure of the web forces µ2 = 1 (or at least µ2 " 1) with veryhigh probability, so the rate of convergence of (4.5) boils down to how fast "k ! 0. Inother words, Google engineers can dictate the rate of convergence according to howsmall they choose " to be.

Consequently, Google engineers are forced to perform a delicate balancing act.The smaller " is, the faster the convergence, but the smaller " is, the less the truehyperlink structure of the web is used to determine webpage importance. And slightlydi!erent values for " can produce very di!erent PageRanks. Moreover, as " ! 1, notonly does convergence slow drastically, but sensitivity issues begin to surface as well[39].

4.4.2. PageRank Accuracy. Another implementation issue is the accuracy ofPageRank computations. We do not know the accuracy with which Google works,but it at least has to be high enough to di!erentiate between the often large list ofranked pages that Google commonly returns. Since !T is a probability vector, each#i will be between 0 and 1. Suppose !T is a 1 by 4 billion vector. Since the PageRankvector is known to follow a power law or Zipfian distribution [3, 50, 25]2, it is possiblethat a small section of the tail of this vector, ranked in decreasing order, might looklike:

!T = ( · · · .000001532 .0000015316 .0000015312 .0000015210 · · · ) .

Accuracy at least on the order of 10!9 is needed to distinguish among the elementsof this ranked subvector. However, comparisons are made only among a subset ofelements of this ranked vector. While the elements of the entire global PageRankvector may be tightly packed in some sections of the (0,1) interval, elements of thesubset related to a particular query are much less densely packed. Therefore, extremeaccuracy on the order of 10!12 is most likely unnecessary for this application.

The fact that Brin and Page report reasonable estimates for !T after only 50iterations of the power method on a matrix of order 322, 000, 000 has one of twoimplications: either (1) their estimates of !T are not very accurate, or (2) the sub-dominant eigenvalue of the iteration matrix is far removed from !1 = 1. The firststatement is a claim that outsiders not privy to inside information can never verify,

2Kamvar et al. have implemented an adaptive power method that exploits the power law struc-ture of the PageRank vector to reduce work per iteration and convergence times [33].

è una matrice stocastica ed irriducibile, essa ha lʼunico autovalore dominante pari a 1, mentre tutti gli altri autovalori strettamente minori di 1.Questo assicura la convergenza del metodo delle potenze allʼautovettore dominante, il vettore di PageRank πT. Lʼelemento cruciale non è dunque la convergenza, ma la velocità con cui il metodo converge, considerando soprattutto la quantità enorme di calcoli che devono essere eseguiti dato che PageRank lavora sullʼintero Web.La velocità di convergenza nel metodo delle potenze è regolata dal “rate”








!T = ( · · · .000001532 .0000015316 .0000015312 .0000015210 · · · ) .




ovvero poiché


Forcing the irreducibility by adding direct connections between each node mightbe overkill. To force irreducibility in a minimal sense, only one nonzero entry needsto be added to the leading position in the (2, 1)-block of zeros in P once it has beenpermuted to canonical form (4.3). In other words, if

¯̄̄P =!

T11 T12

C T22

", where C =

!! 0T

0 0

",

then ¯̄̄P is irreducible. Several other ways of forcing irreducibility have been suggestedand analyzed [39, 10, 59], but Google seems to favor the E = evT approach.

A rather happy accident of Google’s approach is that the eigendistribution of¯̄P = "P̄ + (1 ! ")E is a!ected in an advantageous manner. As pointed out earlier, theasymptotic rate of convergence of the power method is governed by the degree of sep-aration between the dominant and closest subdominant eigenvalues. For the Googlematrix it’s easy to show that if the respective spectrums are #(P̄) = {1, µ2, . . . , µn}and #( ¯̄P) = {1, $2, . . . , $n}, then

$k = "µk for k = 2, 3, . . . , n,(4.4)

regardless of the value of the personalization vector vT in E = evT [45, pg. 502],[39, 30]. Furthermore, the link structure of the web makes it likely that µ2 = 1 (or atleast µ2 " 1). Therefore, choosing " to be farther from 1 increases the gap between1 and $2 and thus speeds the convergence to PageRank. Google originally reportedusing " = .85, which makes it likely that $2 = .85. Since ($2)114 = (.85)114 < 10!8, itfollows that roughly 114 iterations of the power method gives an accuracy on the orderof 10!8 for Google’s PageRank measures, most likely a higher degree of accuracy thanthey need in practical situations. However, on the con side, the maximally irreducibleapproach clearly alters the true nature of the chain much more than the minimallyirreducible approach.

We feel experimental comparisons between the minimally irreducible ¯̄̄P and themaximally irreducible ¯̄P might yield interesting results about surfer behavior and thesensitivity of PageRank to small additive perturbations. Reference [39] contains sometheoretical progress in this area.

4.4. PageRank Implementation. Note that PageRank actually gives an im-portance score for each webpage, not a relevancy score. PageRank is just one partof Google’s ranking system. In fact, PageRank is combined with other scores to givean overall ranking [11]. However, to simplify the examples, we present a basic modelfor use of PageRank. In this model, the implementation of the PageRank IR systeminvolves two primary steps. In the first step, a full document scan determines thesubset of nodes containing the query terms. This subset is called the relevancy setfor the query. This is analogous to the first step of HITS, in which the neighborhoodgraph is formed. In the second step, the relevancy set is sorted according to thePageRank scores of each document in the set. Thus, PageRank does not depend onthe query. In fact, each document has a PageRank score that is independent of allqueries. It has been reported that Google computes PageRank once every few weeksfor all documents in its Web collection [55]. As mentioned earlier, the computation ofPageRank is a costly, time-consuming e!ort that involves finding the stationary vec-tor of an irreducible stochastic matrix whose size is on the order of billions, and thepower method seems to have been Google’s method of choice [28, 15]. The algorithmto compute the PageRank vector !T for the Google matrix ¯̄P = "P̄ + (1 ! ")E is

la velocità di convergenza è dunque regolata da








!T = ( · · · .000001532 .0000015316 .0000015312 .0000015210 · · · ) .




.La struttura del Web, inoltre, forza μ2 = 1; di conseguenza la velocità di convergenza è unicamente legata a quanto velocemente








!T = ( · · · .000001532 .0000015316 .0000015312 .0000015210 · · · ) .




.

In particolare, per valori di α tendenti a 0, il metodo converge più velocemente ma la struttura del web usata per determinare il punteggio delle pagine è meno realistica; viceversa per valori di α tendenti a 1, il metodo converge più lentamente ma usa una struttura del web più realistica.


simply stated. After specifying a value for the tuning parameter !, set !T0 = eT /n,

where n is the size of P, and iterate

!Tk+1 = !!T

k P̄ + (1 ! !)vT(4.5)

until the desired degree of convergence is attained.This implementation (4.5) has a couple of things going for it. First, the method

does not destroy the extreme sparsity that is inherent. Second, the main vector-matrix multiplication of !T

k P̄ requires only sparse inner products, and these sparseinner products are easily implemented in parallel. Using parallelism is imperativefor a problem of this size. More advanced iterative system/eigensolver methods dospeed theoretical convergence, but in practice the fail to deliver due to immensestorage requirements and increased computational complexity. Brin and Page reportrather quick convergence with the simple power method—they claim useful resultsare obtained with only 50 power iterations on a P of order n = 322, 000, 000.

There have been several recent advances in PageRank computation and imple-mentation, proposed largely by researchers at Stanford. Arasu et al. [2] suggest usingthe Gauss-Seidel method [56] in place of the simple power method. On one test exam-ple, they report faster convergence, most especially in the beginning of the iterationhistory. Another group of researchers at Stanford, Kamvar et al., have developed sev-eral modifications to the power method that accelerate convergence. One techniqueuses quadratic extrapolation, similar to Aitken’s !2 method, to speed convergence tothe PageRank vector. The results show speedups of 50-300% over the basic powermethod, with minimal additional overhead [35]. The same group of researchers havealso developed a BlockRank algorithm that uses aggregation methods to empiricallyachieve speedups of a factor of 2 [34]. One final algorithm uses an adaptive methodto monitor the convergence of individual elements of the PageRank vector. As soonas elements of this vector have converged, they are no longer computed. Empiri-cal evidence shows speedier convergence by 17% as the work per iteration decreasesthroughout the iteration history [33]. Very recent work partitions the P matrix intotwo groups according to dangling nodes (with no outlinks) and nondangling nodes[40, 39, 37, 38]. Then using exact aggregation the problem is reduced by a factorof 4 by lumping all the dangling nodes into one state. (Dangling nodes account forone-fourth of the web’s nodes.) The most exciting feature of all these algorithms isthat they do not compete with one another. In fact, it is possible to combine severalalgorithms to achieve even greater speedups.

4.4.1. PageRank Convergence. Since the power method seems to be aboutthe only game in town as far as PageRank is concerned, a couple more comments arein order. Because the iteration matrix P̄ is a stochastic matrix, the spectral radius"(P̄) = 1. If this stochastic matrix is reducible, it may have several eigenvalues on theunit circle, causing convergence problems for the power method. One such problemwas identified by Brin and Page as a rank sink, a node with no outlinks that keepsaccumulating more and more PageRank at each iteration. This rank sink is actuallyan absorbing state of the Markov chain. More generally, a reducible matrix maycontain an absorbing class that eventually sucks all the PageRank into states in itsclass. The web graph may contain several such classes and the long-run probabilitiesof the chain then depend greatly on the starting vector. Some states and classes mayhave 0 rank in the long-run, giving an undesirable solution and interpretation for thePageRank problem. However, the situation is much nicer and the convergence muchcleaner for an irreducible matrix.


simply stated. After specifying a value for the tuning parameter !, set !T0 = eT /n,

where n is the size of P, and iterate

!Tk+1 = !!T

k P̄ + (1 ! !)vT(4.5)

until the desired degree of convergence is attained.This implementation (4.5) has a couple of things going for it. First, the method

does not destroy the extreme sparsity that is inherent. Second, the main vector-matrix multiplication of !T

k P̄ requires only sparse inner products, and these sparseinner products are easily implemented in parallel. Using parallelism is imperativefor a problem of this size. More advanced iterative system/eigensolver methods dospeed theoretical convergence, but in practice the fail to deliver due to immensestorage requirements and increased computational complexity. Brin and Page reportrather quick convergence with the simple power method—they claim useful resultsare obtained with only 50 power iterations on a P of order n = 322, 000, 000.

There have been several recent advances in PageRank computation and imple-mentation, proposed largely by researchers at Stanford. Arasu et al. [2] suggest usingthe Gauss-Seidel method [56] in place of the simple power method. On one test exam-ple, they report faster convergence, most especially in the beginning of the iterationhistory. Another group of researchers at Stanford, Kamvar et al., have developed sev-eral modifications to the power method that accelerate convergence. One techniqueuses quadratic extrapolation, similar to Aitken’s !2 method, to speed convergence tothe PageRank vector. The results show speedups of 50-300% over the basic powermethod, with minimal additional overhead [35]. The same group of researchers havealso developed a BlockRank algorithm that uses aggregation methods to empiricallyachieve speedups of a factor of 2 [34]. One final algorithm uses an adaptive methodto monitor the convergence of individual elements of the PageRank vector. As soonas elements of this vector have converged, they are no longer computed. Empiri-cal evidence shows speedier convergence by 17% as the work per iteration decreasesthroughout the iteration history [33]. Very recent work partitions the P matrix intotwo groups according to dangling nodes (with no outlinks) and nondangling nodes[40, 39, 37, 38]. Then using exact aggregation the problem is reduced by a factorof 4 by lumping all the dangling nodes into one state. (Dangling nodes account forone-fourth of the web’s nodes.) The most exciting feature of all these algorithms isthat they do not compete with one another. In fact, it is possible to combine severalalgorithms to achieve even greater speedups.

4.4.1. PageRank Convergence. Since the power method seems to be aboutthe only game in town as far as PageRank is concerned, a couple more comments arein order. Because the iteration matrix P̄ is a stochastic matrix, the spectral radius"(P̄) = 1. If this stochastic matrix is reducible, it may have several eigenvalues on theunit circle, causing convergence problems for the power method. One such problemwas identified by Brin and Page as a rank sink, a node with no outlinks that keepsaccumulating more and more PageRank at each iteration. This rank sink is actuallyan absorbing state of the Markov chain. More generally, a reducible matrix maycontain an absorbing class that eventually sucks all the PageRank into states in itsclass. The web graph may contain several such classes and the long-run probabilitiesof the chain then depend greatly on the starting vector. Some states and classes mayhave 0 rank in the long-run, giving an undesirable solution and interpretation for thePageRank problem. However, the situation is much nicer and the convergence muchcleaner for an irreducible matrix.

Accuratezza in PageRank

Dato che il vettore πT e un vettore di probabilità, ogni componente i-esima di tale vettore sarà un numero compreso tra 0 e 1. Supponendo che il vettore sia di dimensioni 1 x 4 miliardi, sarebbe richiesta unʼaccuratezza di almeno 10-9 per poter distinguere le singole componenti presenti nel vettore, soprattutto per quel che riguarda le componenti in coda ad esso.

PageRank Updating

Google ha affermato che il vettore di PageRank è aggiornato una volta ogni 2-3 settimane. Nel frattempo è chiaro che la struttura del web venga notevolmente modificata.Il vettore di PageRank relativo ad un periodo precedente nn può dunque essere riutilizzato per calcolare quello aggiornato a causa dellʼintroduzione o della modifica delle pagine o dei link fra le stesse.

Questo problema rappresenta un problema molto importante per gli ingegneri di Google ed è ancora unʼarea in cui si effettua molta ricerca.

Punti di Forza e punti deboli di PageRank

Vantaggi:• Indipendenza dalla query nellʼassegnazione dei punteggi• Resistenza allo spamming in quanto PageRank è una misura globale• Flessibilità della personalizzazione tramite il vettore vT

Svantaggi:• Topic Drift, in quanto il punteggio associato alle pagine è basato sullʼimportanza delle

stesse, e non sulla rilevanza rispetto allʼinformazione cercata dallʼutente.

SALSASalsa nasce con lʼintento di unire i punti di forza di Hits e PageRank al fine di superare le principali limitazioni di entrambe le tecniche di Web Retrival.Lʼidea di base è quella di associare due punteggi, uno di Hub e uno di Authority, a ciascun nodo, usando le catene di Markov.

Implementazione di Salsa

Inizialmente si costruisce, analogamente a quanto accade per Salsa, un grafo di vicinanza N associato ad una query.Successivamente si costruisce un grafo bipartito G con

Vh = { nodi di Hub }Va = { nodi di Authority }

E = { archi }


denoted G, is built. G is defined by three sets: Vh, Va, E, where Vh is the set of hubnodes (all nodes in N with outdegree > 0), Va is the set of authority nodes (all nodesin N with indegree > 0) and E is the set of directed edges in N . Note that a nodein N may be in both Vh and Va. For the example with neighborhood graph given byFigure 3.2,

Vh = {1, 2, 3, 6, 10},Va = {1, 3, 5, 6}.

The bipartite undirected graph G has a “hub side” and an “authority side”. Nodesin Vh are listed on the hub side and nodes in Va are on the authority side. Everydirected edge in E is represented by an undirected edge in G. Next, two Markov

1

2

1

6

10

3

3

5

6

sideauthority

sidehub

Fig. 5.1. G: bipartite graph for SALSA

chains are formed from G, a hub Markov chain with transition probability matrix H,and an authority Markov chain with matrix A. Reference [41] contains a formulafor computing the elements of H and A, but we feel a more instructive approach tobuilding H and A clearly reveals SALSA’s connection to both HITS and PageRank.Recall that L is the adjacency matrix of N used by HITS. HITS computes authorityand hub scores using the unweighted matrix L, while PageRank computes a measureanalogous to an authority score using a row-weighted matrix P. SALSA uses bothrow and column weighting to compute its hub and authority scores. Let Lr be Lwith each nonzero row divided by its row sum and Lc be L with each nonzero columndivided by its column sum. Then H, SALSA’s hub matrix consists of the nonzerorows and columns of LrLT

c and A is the nonzero rows and columns of LTc Lr. For our

running example (from Figure 3.2),

L =

!

""""""#

1 2 3 5 6 101 0 0 1 0 1 02 1 0 0 0 0 03 0 0 0 0 1 05 0 0 0 0 0 06 0 0 1 1 0 010 0 0 0 0 1 0

$

%%%%%%&, Lr =

!

""""""#

1 2 3 5 6 101 0 0 1

2 0 12 0

2 1 0 0 0 0 03 0 0 0 0 1 05 0 0 0 0 0 06 0 0 1

212 0 0

10 0 0 0 0 1 0

$

%%%%%%&,

Si costruiscono a partire dal grafo G prima una matrice di Adiacenza L, successivamente a partire da L si costruiscono altre due matrici Lr ed Lc siffatte:

Lr = ogni elemento lij è diviso per la somma lungo la riga i-esimaLc = ogni elemento lij è diviso per la somma lungo la colonna j-esima


denoted G, is built. G is defined by three sets: Vh, Va, E, where Vh is the set of hubnodes (all nodes in N with outdegree > 0), Va is the set of authority nodes (all nodesin N with indegree > 0) and E is the set of directed edges in N . Note that a nodein N may be in both Vh and Va. For the example with neighborhood graph given byFigure 3.2,

Vh = {1, 2, 3, 6, 10},Va = {1, 3, 5, 6}.

The bipartite undirected graph G has a “hub side” and an “authority side”. Nodesin Vh are listed on the hub side and nodes in Va are on the authority side. Everydirected edge in E is represented by an undirected edge in G. Next, two Markov

1

2

1

6

10

3

3

5

6

sideauthority

sidehub

Fig. 5.1. G: bipartite graph for SALSA

chains are formed from G, a hub Markov chain with transition probability matrix H,and an authority Markov chain with matrix A. Reference [41] contains a formulafor computing the elements of H and A, but we feel a more instructive approach tobuilding H and A clearly reveals SALSA’s connection to both HITS and PageRank.Recall that L is the adjacency matrix of N used by HITS. HITS computes authorityand hub scores using the unweighted matrix L, while PageRank computes a measureanalogous to an authority score using a row-weighted matrix P. SALSA uses bothrow and column weighting to compute its hub and authority scores. Let Lr be Lwith each nonzero row divided by its row sum and Lc be L with each nonzero columndivided by its column sum. Then H, SALSA’s hub matrix consists of the nonzerorows and columns of LrLT

c and A is the nonzero rows and columns of LTc Lr. For our

running example (from Figure 3.2),

L =

!

""""""#

1 2 3 5 6 101 0 0 1 0 1 02 1 0 0 0 0 03 0 0 0 0 1 05 0 0 0 0 0 06 0 0 1 1 0 010 0 0 0 0 1 0

$

%%%%%%&, Lr =

!

""""""#

1 2 3 5 6 101 0 0 1

2 0 12 0

2 1 0 0 0 0 03 0 0 0 0 1 05 0 0 0 0 0 06 0 0 1

212 0 0

10 0 0 0 0 1 0

$

%%%%%%&,


and

Lc =

!

""""""#

1 2 3 5 6 101 0 0 1

2 0 13 0

2 1 0 0 0 0 03 0 0 0 0 1

3 05 0 0 0 0 0 06 0 0 1

2 1 0 010 0 0 0 0 1

3 0

$

%%%%%%&.

Hence

LrLTc =

!

""""""#

1 2 3 5 6 101 5

12 0 212 0 3

12212

2 0 1 0 0 0 03 1

3 0 13 0 0 1

35 0 0 0 0 0 06 1

4 0 1 0 34 0

10 13 0 1

3 0 0 13

$

%%%%%%&, LT

c Lr =

!

""""""#

1 2 3 5 6 101 1 0 0 0 0 02 0 0 0 0 0 03 0 0 1

214

14 0

5 0 0 12

12 0 0

6 0 0 16 0 5

6 010 0 0 0 0 0 0

$

%%%%%%&.

Taking just the nonzero rows and columns of LrLTc to form H gives

H =

!

""""#

1 2 3 6 101 5

12 0 212

312

212

2 0 1 0 0 03 1

3 0 13 0 1

36 1

4 0 0 34 0

10 13

13 0 0 1

3

$

%%%%&.

Similarly,

A =

!

""#

1 3 5 61 1 0 0 03 0 1

214

14

5 0 12

12 0

6 0 16 0 5

6

$

%%&.

If G is connected, then H and A are both irreducible Markov chains and !Th , the

stationary vector of H, gives the hub scores for the query with neighborhood graph Nand !T

a gives the authority scores. If G is not connected, then H and A contain multi-ple irreducible components. In this case, the global hub and authority scores must bepasted together from the stationary vectors for each individual irreducible component.(Reference [41] contains the justification for the above two if-then statements.)

From the structure of the H and A matrices for our example, it is easy to seethat G is not connected. Thus, H and A contain multiple connected components.H contains two connected components, C = {2} and D = {1, 3, 6, 10}, while A’sconnected components are E = {1} and F = {3, 5, 6}. Also clear from the structureof H and A is the periodicity of the Markov chains. All irreducible components ofH and A contain self-loops, implying that the chains are aperiodic. The stationaryvectors for the two irreducible components of H are

!Th (C) =

' 21

(, !T

h (D) =' 1 3 6 10

13

16

13

16

(,

A questo punto si costruiscono le matrici LrLTc e LTcLr


and

Lc =

!

""""""#

1 2 3 5 6 101 0 0 1

2 0 13 0

2 1 0 0 0 0 03 0 0 0 0 1

3 05 0 0 0 0 0 06 0 0 1

2 1 0 010 0 0 0 0 1

3 0

$

%%%%%%&.

Hence

LrLTc =

!

""""""#

1 2 3 5 6 101 5

12 0 212 0 3

12212

2 0 1 0 0 0 03 1

3 0 13 0 0 1

35 0 0 0 0 0 06 1

4 0 1 0 34 0

10 13 0 1

3 0 0 13

$

%%%%%%&, LT

c Lr =

!

""""""#

1 2 3 5 6 101 1 0 0 0 0 02 0 0 0 0 0 03 0 0 1

214

14 0

5 0 0 12

12 0 0

6 0 0 16 0 5

6 010 0 0 0 0 0 0

$

%%%%%%&.


H =

!

""""#

1 2 3 6 101 5

12 0 212

312

212

2 0 1 0 0 03 1

3 0 13 0 1

36 1

4 0 0 34 0

10 13

13 0 0 1

3

$

%%%%&.

Similarly,

A =

!

""#

1 3 5 61 1 0 0 03 0 1

214

14

5 0 12

12 0

6 0 16 0 5

6

$

%%&.





!Th (C) =

' 21

(, !T

h (D) =' 1 3 6 10

13

16

13

16

(,

e, infine, si ottengono le matrici:

H matrice di Hub = si rimuovono da LrLTc le righe e le colonne pari a 0A matrice di Authority = si rimuovono da LTcLr le righe e le colonne pari a 0


and

Lc =

!

""""""#

1 2 3 5 6 101 0 0 1

2 0 13 0

2 1 0 0 0 0 03 0 0 0 0 1

3 05 0 0 0 0 0 06 0 0 1

2 1 0 010 0 0 0 0 1

3 0

$

%%%%%%&.

Hence

LrLTc =

!

""""""#

1 2 3 5 6 101 5

12 0 212 0 3

12212

2 0 1 0 0 0 03 1

3 0 13 0 0 1

35 0 0 0 0 0 06 1

4 0 1 0 34 0

10 13 0 1

3 0 0 13

$

%%%%%%&, LT

c Lr =

!

""""""#

1 2 3 5 6 101 1 0 0 0 0 02 0 0 0 0 0 03 0 0 1

214

14 0

5 0 0 12

12 0 0

6 0 0 16 0 5

6 010 0 0 0 0 0 0

$

%%%%%%&.


H =

!

""""#

1 2 3 6 101 5

12 0 212

312

212

2 0 1 0 0 03 1

3 0 13 0 1

36 1

4 0 0 34 0

10 13

13 0 0 1

3

$

%%%%&.

Similarly,

A =

!

""#

1 3 5 61 1 0 0 03 0 1

214

14

5 0 12

12 0

6 0 16 0 5

6

$

%%&.





!Th (C) =

' 21

(, !T

h (D) =' 1 3 6 10

13

16

13

16

(,

Per calcolare i due vettori dei punteggi si torna ad osservare la struttura di G.

• Se G è connesso

si calcolano i vettori stazionari di H e A, rispettivamente


and

Lc =

!

""""""#

1 2 3 5 6 101 0 0 1

2 0 13 0

2 1 0 0 0 0 03 0 0 0 0 1

3 05 0 0 0 0 0 06 0 0 1

2 1 0 010 0 0 0 0 1

3 0

$

%%%%%%&.

Hence

LrLTc =

!

""""""#

1 2 3 5 6 101 5

12 0 212 0 3

12212

2 0 1 0 0 0 03 1

3 0 13 0 0 1

35 0 0 0 0 0 06 1

4 0 1 0 34 0

10 13 0 1

3 0 0 13

$

%%%%%%&, LT

c Lr =

!

""""""#

1 2 3 5 6 101 1 0 0 0 0 02 0 0 0 0 0 03 0 0 1

214

14 0

5 0 0 12

12 0 0

6 0 0 16 0 5

6 010 0 0 0 0 0 0

$

%%%%%%&.


H =

!

""""#

1 2 3 6 101 5

12 0 212

312

212

2 0 1 0 0 03 1

3 0 13 0 1

36 1

4 0 0 34 0

10 13

13 0 0 1

3

$

%%%%&.

Similarly,

A =

!

""#

1 3 5 61 1 0 0 03 0 1

214

14

5 0 12

12 0

6 0 16 0 5

6

$

%%&.





!Th (C) =

' 21

(, !T

h (D) =' 1 3 6 10

13

16

13

16

(,

e


and

Lc =

!

""""""#

1 2 3 5 6 101 0 0 1

2 0 13 0

2 1 0 0 0 0 03 0 0 0 0 1

3 05 0 0 0 0 0 06 0 0 1

2 1 0 010 0 0 0 0 1

3 0

$

%%%%%%&.

Hence

LrLTc =

!

""""""#

1 2 3 5 6 101 5

12 0 212 0 3

12212

2 0 1 0 0 0 03 1

3 0 13 0 0 1

35 0 0 0 0 0 06 1

4 0 1 0 34 0

10 13 0 1

3 0 0 13

$

%%%%%%&, LT

c Lr =

!

""""""#

1 2 3 5 6 101 1 0 0 0 0 02 0 0 0 0 0 03 0 0 1

214

14 0

5 0 0 12

12 0 0

6 0 0 16 0 5

6 010 0 0 0 0 0 0

$

%%%%%%&.


H =

!

""""#

1 2 3 6 101 5

12 0 212

312

212

2 0 1 0 0 03 1

3 0 13 0 1

36 1

4 0 0 34 0

10 13

13 0 0 1

3

$

%%%%&.

Similarly,

A =

!

""#

1 3 5 61 1 0 0 03 0 1

214

14

5 0 12

12 0

6 0 16 0 5

6

$

%%&.





!Th (C) =

' 21

(, !T

h (D) =' 1 3 6 10

13

16

13

16

(,

contenenti i punteggi di Hub e Authority

• se G NON è connesso

implica che contiene più catene di Markov irriducibili. Si determinano così i vettori stazionari per ciascuna sottocatena.In particolare, nellʼesempio considerato, si possono individuare in H e A le seguenti sotto catene:

Dalla matrice HC = { 2 }

D = { 1, 3, 6, 10 }

dalla matrice AE = { 1 }

F = { 3, 5, 6}

I vettori stazionari per ogni catena saranno i seguenti:


and

Lc =

!

""""""#

1 2 3 5 6 101 0 0 1

2 0 13 0

2 1 0 0 0 0 03 0 0 0 0 1

3 05 0 0 0 0 0 06 0 0 1

2 1 0 010 0 0 0 0 1

3 0

$

%%%%%%&.

Hence

LrLTc =

!

""""""#

1 2 3 5 6 101 5

12 0 212 0 3

12212

2 0 1 0 0 0 03 1

3 0 13 0 0 1

35 0 0 0 0 0 06 1

4 0 1 0 34 0

10 13 0 1

3 0 0 13

$

%%%%%%&, LT

c Lr =

!

""""""#

1 2 3 5 6 101 1 0 0 0 0 02 0 0 0 0 0 03 0 0 1

214

14 0

5 0 0 12

12 0 0

6 0 0 16 0 5

6 010 0 0 0 0 0 0

$

%%%%%%&.


H =

!

""""#

1 2 3 6 101 5

12 0 212

312

212

2 0 1 0 0 03 1

3 0 13 0 1

36 1

4 0 0 34 0

10 13

13 0 0 1

3

$

%%%%&.

Similarly,

A =

!

""#

1 3 5 61 1 0 0 03 0 1

214

14

5 0 12

12 0

6 0 16 0 5

6

$

%%&.





!Th (C) =

' 21

(, !T

h (D) =' 1 3 6 10

13

16

13

16

(,


while the stationary vectors for the two irreducible components of A are

!Ta (E) =

! 11

", !T

a (F ) =! 3 5 6

13

16

12

".

Proposition 6 of [41] contains the method for pasting the hub and authority scoresfor the individual components into global scoring vectors. Their suggestion is simpleand intuitive. Since the hub component C only contains 1 of the 5 total hub nodes,its stationary hub vector should be weighted by 1/5, while D, containing 4 of the 5hub nodes, has its stationary vector weighted by 4/5. Thus the global hub vector is

!Th =

! 1 2 3 6 1045 · 1

315 · 1 4

5 · 16

45 · 1

345 · 1

6

"

=! 1 2 3 6 10.2667 .2 .1333 .2667 .1333

".

With similar weighting for authority nodes, the global authority vector can be con-structed from the individual authority vectors as

!Th =

! 1 3 5 614 · 1 3

4 · 13

34 · 1

634 · 1

2

"

=! 1 3 5 6.25 .25 .125 .375

".

Compare the SALSA hub and authority vectors with those of HITS in section 3.3.They are quite di!erent. Notice that the presence of multiple connected components(which occurs when G is not connected) is actually a very good thing computationally,because the Markov chains to be solved are much smaller. This can be contrastedwith PageRank’s correction for a disconnected web graph, whereby irreducibility isforced by adding direct connections between all nodes. Also, note that other weightingschemes can be applied to paste the individual component scores together to createglobal scores.

5.2. Strengths and Weaknesses of SALSA. As SALSA was developed bycombining some of the best features of both HITS and PageRank, it has manystrengths. Unlike HITS, SALSA is not victimized by the topic drift problem, relatedto the “TKC” problem referenced by [41]. Recall that another problem of HITS wasits susceptibility to spamming due to the interdependence of hub and authority scores.SALSA is less susceptible to spamming since the coupling between hub and author-ity scores is much less strict. However, neither HITS nor SALSA are as imperviousto spamming as PageRank. SALSA, like HITS, also gives dual rankings, somethingwhich PageRank does not supply. The presence of multiple connected componentsin SALSA’s bipartite graph G, a common occurrence in practice, is computationallywelcomed. However, one major drawback to the widespread use of SALSA is itsquery-dependence. At query time, the neighborhood graph N for the query must beformed and two Markov chains must be solved. Another problematic issue for SALSAis convergence. The convergence of SALSA is similar to that of HITS. Because, bothHITS and SALSA, unlike PageRank, do not force irreducibility onto the graph, theresulting vectors produced by their algorithms may not be unique (and may dependon the starting vector) if the neighborhood graph is reducible. Nevertheless, a simple


and

Lc =

!

""""""#

1 2 3 5 6 101 0 0 1

2 0 13 0

2 1 0 0 0 0 03 0 0 0 0 1

3 05 0 0 0 0 0 06 0 0 1

2 1 0 010 0 0 0 0 1

3 0

$

%%%%%%&.

Hence

LrLTc =

!

""""""#

1 2 3 5 6 101 5

12 0 212 0 3

12212

2 0 1 0 0 0 03 1

3 0 13 0 0 1

35 0 0 0 0 0 06 1

4 0 1 0 34 0

10 13 0 1

3 0 0 13

$

%%%%%%&, LT

c Lr =

!

""""""#

1 2 3 5 6 101 1 0 0 0 0 02 0 0 0 0 0 03 0 0 1

214

14 0

5 0 0 12

12 0 0

6 0 0 16 0 5

6 010 0 0 0 0 0 0

$

%%%%%%&.


H =

!

""""#

1 2 3 6 101 5

12 0 212

312

212

2 0 1 0 0 03 1

3 0 13 0 1

36 1

4 0 0 34 0

10 13

13 0 0 1

3

$

%%%%&.

Similarly,

A =

!

""#

1 3 5 61 1 0 0 03 0 1

214

14

5 0 12

12 0

6 0 16 0 5

6

$

%%&.





!Th (C) =

' 21

(, !T

h (D) =' 1 3 6 10

13

16

13

16

(,

Lʼultimo step consiste nellʼunire insieme gli n vettori di authority in un unico vettore e gli n vettori di hub in un altro vettore. Il meccanismo per unirli è il seguente:

I. Ogni componente del vettore si moltiplica per x / n , dove x rappresenta il numero di nodi nellʼinsieme considerato, ed n il numero totale nellʼinsieme di Hub/Authority

N.B. Nel calcolo dei vettori stazionari, è cosa desiderabile in SALSA avere dei grafi non connessi, poiché riduce il costo computazionale legato al calcolo degli stessi (saranno ovviamente vettori di dimensioni piccole rispetto allʼintero grafo)

Punti di forza e punti deboli di SALSA

Vantaggi:• Non soffre del problema del topic drift• meno suscettibile allo spam poichè i punteggi di Hub e Authority sono meno influenzabili

rispetto a Hits• mantiene il doppio ranking, basato sullʼimportanza, come in PageRank• poco costoso se il grafo G è disconnesso

Svantaggi:• dipendente dalla query• se il grafo di vicinanza è riducibile, i vettori stazionari risultanti potrebbero non essere

unici, e quindi il metodo potrebbe nn convergere.

HITS e Pagerank

Documents

node e

web ir methods

future of web

web ir systems

e hub score yi

the0 web graph

hub node

preso e