1 Anti-spam Algorithm Anti-spam Algorithm TrustRank TrustRank 、 、 Hil Hil ltop ltop 954203041 954203041 林林林 林林林 954203057 954203057 林林林 林林林
1
Anti-spam AlgorithmAnti-spam AlgorithmTrustRankTrustRank 、、 HilltoHilltopp
Anti-spam AlgorithmAnti-spam AlgorithmTrustRankTrustRank 、、 HilltoHilltopp
954203041954203041林裕得林裕得 954203057954203057蔡繼正蔡繼正
2
Outline• Introduction• Compare with Page Rank 、 Trust Rank 、 Hilltop• Trust Rank
– Combating Web Spam with Trust Rank
• Hilltop
– Hilltop: A Search Engine based on Expert Documents
• Evaluation
3
Introduction(1/3)• Page rank• Current Problem
– web spam pages use various techniques to achieve higher-than-deserved rankings in a search engine’s result
0.4
0.4
0.2
4
Introduction(2/3)• Type of web spam
– Content spam • Hidden or invisible text • Keyword stuffing • Meta tag stuffing
– Link spam • Link farms • Hidden links
– Other types• Mirror websites • URL redirection
5
Introduction(3/3)• How to combat web spam
– TrustRank or Hilltop :哪些頁面肯定不是作弊頁面
– BadRank or SpamRank:哪些頁面肯定是作弊頁面
– Sandbox:不能有效的識別哪些是作弊或者不作弊頁面,但是可以通過這種行為有效的打壓 SEO 市場
– 人工舉報和具體 ANTI-SPAM 方法: 幫助建立更加全面的 SPAM POOL 資源
• http://www.google.com/contact/spamreport.html
6
Compare with Page Rank 、Trust Rank 、 Hilltop(1/3)
• All are connectivity algorithms, namely that the number and quality of the sources referring to a page are a good measure of the page's quality.
7
Compare with Page Rank 、Trust Rank 、 Hilltop(2/3)
• Basic assumption– Page rank
• good page has many important inlinks.
– Trust rank• Good pages point to good ones.
– Hilltop rank• Only expert pages point to good ones.
8
Compare with Page Rank 、Trust Rank 、 Hilltop(3/3)
Page Rank Trust Rank Hilltop
Inlinks Source All pages All pages Expert pages
Initial Score
Average 1 or 0 Algorithm
0.16
0.16
0.160.16
0.16
0.16
0.33
0
0.33 0
0.33
0
0.5
0.20.3
9
Trust Rank(1/7)
0 0 0 ….1 0 1 ….0 0.5 0 .…0 0.5 0 …………………
0 0.5 0 …
0 0 0.5 …
0 0.5 0 …
………………
10
11
Trust Rank (2/7)• Step1 : Evaluate seed-desirability of
pages By Inverse Page Rank
12
U SM1N …………….
M
13
Trust Rank(3/7)• Step2 : Generate good seeds
14
Trust Rank(4/7)• Step3 : Select good seeds
ex, L=3, seed set is {2,4,5}
15
Trust Rank(5/7)• Step4 : normalize static score
distribution vector
16
Trust Rank(6/7)• Step5:Compute TrustRank score
T d t*…………….
M
17
Trust Rank(7/7)• Conclusion
18
Hilltop (1/9)• expert page
– a page is about a certain topic and has links to many non-affiliated pages on that topic.
• non-affiliated – Two pages are non-affiliated conceptually if
they are authored by authors from non-affiliated organizations.
19
Hilltop (2/9)• Step1 : Expert Lookup
– Detecting Host Affiliation – Selecting the Experts
– Indexing the Experts
non-affiliated pagesexpert page
……
Index key phrases
20
Hilltop (3/9)• Detecting Host Affiliation
– Rules: one or both of the following must be true
• Affiliation relation is transitive – if A and B are affiliated and B and C are affiliated then
we take A and C to be affiliated
• They share the same first 3 octets of the IP
address.
• The rightmost non-generic token in the hostname
is the same.
ex, “www.ibm.com" and
"ibm.co.mx“
21
Hilltop (4/9)• Selecting the Experts
– Considering all pages with out-degree greater than a threshold, k (e.g., k=5) we test to see if these URLs point to k distinct non-affiliated hosts. Every such page is considered an expert page.
non-affiliated pagesexpert page
……
22
Hilltop (5/9)• Indexing the Experts
– index text contained within "key phrases" of the expert. The following are considered key phrases.
• title• headings (e.g., <H1> </H1> tags)• anchor text
– A key phrase is a piece of text that qualifies one or more URLs in the page. And every key phrase has a scope with the document text.
23
Hilltop (6/9)• Example
– Title qualify 4 URLs– heading qualify 2 URLs– anchor qualify 1 URLs
<title> 中央大學 </title>
<h1> 資管系 </h1> <A> 001 </A> <A> 002 </A>
<h1> 企管系 </h1> <A> 001 </A> <A> 002 </A>
24
Hilltop (7/9)• Step2 : Target Ranking
– Computing the Expert Score
– Computing the Target Score
Target page expert pages
N = 200
…… Least 2 experts point to target
25
Hilltop (8/9)• Computing the Expert Score
– Expert score reflect the number and importance of the key phrases that contain the query keywords.
26
Computing the Expert Score(1/2)
S0 :包含 k 個 ( 所有 )keywords 的 Key Phrase 的總值 S1 :包含 k-1 個 keywords 的 Key Phrase 的總值 S2 :包含 k-2 個 keywords 的 Key Phrase 的總值
Si = SUM(key phrases p with k-i query terms)
LevelScore(p) * FullnessFactor(p,q)
LevelScore : 16 of title, 6 of heading, 1 of anchor
m is the number of terms in p which are not in q If m <= 2, FullnessFactor(p,q) = 1 If m > 2, FullnessFactor(p,q) = 1 – (m-2) / plen
Query: A B
S0 = 16*1S1 = 16*1 + 6*1 + 16*1S2 = 0
Title: A B C
H1: A
27
Computing the Expert Score(2/2)
S0 :包含 k 個 ( 所有 )keywords 的 Key Phrase 的總值 S1 :包含 k-1 個 keywords 的 Key Phrase 的總值 S2 :包含 k-2 個 keywords 的 Key Phrase 的總值
Expert_Score = ( 232 * S0 ) + ( 216 * S1 ) + S2
28
Hilltop(9/9)• Computing the Target Score
– Target score reflect both the number and relevance of the experts pointing to it
– And the relevance of the phrases qualifying the links.
29
Computing the Target Score(1/2)
occ(w,T) is the number of distinct key phrases in E that contain w and qualify the edge(E,T)
If occ(w,T) is 0 for any query keyword then the Edge_Score(E,T) = 0
Otherwise,
Edge_Score(E,T) = Expert_Score(E) * SUM(query keywords w) occ(w,T)
TEedge
30
Computing the Target Score(2/2)
Target_Score = SUM( non-affiliated E) Edge_Score(E,T)
T
E1
E2
E3
E2 and E3 are affiliated, and ES(E2,T) > ES(E3,T)
31
Evaluation-Trust Rank(1/3)
32
Evaluation-Trust Rank(2/3)
• Pairwise Orderness
33
Evaluation-Trust Rank(3/3)
• Precision
• Recall
34
Evaluation-Hilltop(1/2)• Precision
35
Evaluation-Hilltop(2/2)• Recall
36
Reference• Combating Web Spam with Trust Rank
– http://www.vldb.org/conf/2004/RS15P3.PDF• Hilltop: A Search Engine based on Expert Documents
– http://www.cs.toronto.edu/~georgem/hilltop/
• Type of web spam
– http://en.wikipedia.org/wiki/Spamdexing
37
Q&AQ&A