Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions Using rank propagation and Probabilistic counting for Link-Based Spam Detection Luca Becchetti 1 , Carlos Castillo 1 ,Debora Donato 1 , Stefano Leonardi 1 and Ricardo Baeza-Yates 2 1. Universit` a di Roma “La Sapienza” – Rome, Italy 2. Yahoo! Research – Barcelona, Spain and Santiago, Chile August 20th, 2006
74
Embed
Using Rank Propagation for Spam Detection (WebKDD 2006)
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Using rank propagation and Probabilisticcounting for Link-Based Spam Detection
Luca Becchetti1 Carlos Castillo1Debora Donato1Stefano Leonardi1 and Ricardo Baeza-Yates2
1 Universita di Roma ldquoLa Sapienzardquo ndash Rome Italy2 Yahoo Research ndash Barcelona Spain and Santiago Chile
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
Motivation
Spam pages characterization
Truncated PageRank
Counting supporters
Experiments
Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
Motivation
Spam pages characterization
Truncated PageRank
Counting supporters
Experiments
Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
Motivation
Spam pages characterization
Truncated PageRank
Counting supporters
Experiments
Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
Motivation
Spam pages characterization
Truncated PageRank
Counting supporters
Experiments
Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
Motivation
Spam pages characterization
Truncated PageRank
Counting supporters
Experiments
Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
Motivation
Spam pages characterization
Truncated PageRank
Counting supporters
Experiments
Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
Motivation
Spam pages characterization
Truncated PageRank
Counting supporters
Experiments
Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
Motivation
Spam pages characterization
Truncated PageRank
Counting supporters
Experiments
Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
Motivation
Spam pages characterization
Truncated PageRank
Counting supporters
Experiments
Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
Motivation
Spam pages characterization
Truncated PageRank
Counting supporters
Experiments
Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
Motivation
Spam pages characterization
Truncated PageRank
Counting supporters
Experiments
Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
Motivation
Spam pages characterization
Truncated PageRank
Counting supporters
Experiments
Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
Motivation
Spam pages characterization
Truncated PageRank
Counting supporters
Experiments
Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
Motivation
Spam pages characterization
Truncated PageRank
Counting supporters
Experiments
Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
Motivation
Spam pages characterization
Truncated PageRank
Counting supporters
Experiments
Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
Motivation
Spam pages characterization
Truncated PageRank
Counting supporters
Experiments
Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
Motivation
Spam pages characterization
Truncated PageRank
Counting supporters
Experiments
Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
Motivation
Spam pages characterization
Truncated PageRank
Counting supporters
Experiments
Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
Motivation
Spam pages characterization
Truncated PageRank
Counting supporters
Experiments
Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
Motivation
Spam pages characterization
Truncated PageRank
Counting supporters
Experiments
Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
Motivation
Spam pages characterization
Truncated PageRank
Counting supporters
Experiments
Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
Motivation
Spam pages characterization
Truncated PageRank
Counting supporters
Experiments
Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
Motivation
Spam pages characterization
Truncated PageRank
Counting supporters
Experiments
Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg