Top Banner
- 1 - Combating Link Spam M.Tech. Seminar Report Submitted in partial fulfillment of the requirements for the degree of Master of Technology by Jubin Chheda Roll No 06305003 under the guidance of: Prof. Soumen Chakrabarti examined by: Om P. Damani Department of Computer Science and Engineering, Indian Institute of Technology, Bombay.
23
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Seminar Report

- 1 -

Combating Link Spam M.Tech. Seminar Report

Submitted in partial fulfillment of the requirements for the degree of

Master of Technology

by Jubin Chheda

Roll No 06305003

under the guidance of: Prof. Soumen Chakrabarti

examined by: Om P. Damani

Department of Computer Science and Engineering,

Indian Institute of Technology, Bombay.

Page 2: Seminar Report

- 2 -

Abstract: As more and more people rely on search engines as starting points to fulfill their need for information, it has become absolutely important to have one’s page rank up in the top few results of popular search engines. Most search engines use, among other things, variants of the classic PageRank algorithm, which relies on the link structure of the web to rank pages. In order to have their pages rank higher than deserving, some web designers, resort to all sorts of tricks to mislead search engines by manipulating linkage (link-spam) and content(term-spam) on their pages and the web, in the process give form to what has come to be called web-spam. There is a continuing clash between search engine algorithm-designers and web-spammers leading to this battleground of the Adversarial Web. Our main focus in this report is link-spam. We take a look at the different methods of combating link-spam. We also look at optimal link-spam structures and test them using Java code. We implement popular algorithms for ranking algorithms and test the efficacy of these on a web-graph made available by Webaroo.

Page 3: Seminar Report

- 3 -

Table of Contents:

I. Introduction:.................................................................................................... 4 II. Web Model: .................................................................................................... 6 III. Ranking Algorithms:....................................................................................... 7

A. PageRank: 7 B. TrustRank: 9

IV. Web Spam Taxonomy: ................................................................................. 11 A. Link Spamming Techniques: 12 B. Term Spamming: 13 C. Hiding Techniques: 14

V. Optimal Spamming Structures:..................................................................... 15 VI. Tweaking Ranking Algorithms: a solution to link-spam? ............................ 18

A. Antitrust and Distrust Ranking: 18 B. Combining Trust and Distrust: 18 C. Truncated PageRank: 18 D. Topical TrustRank: 19

VII. Statistics about pages: features to classify link-spam:.................................. 20 VIII. Scope for Future work: ................................................................................. 22 IX. Conclusion: ................................................................................................... 22 X. References:.................................................................................................... 23

Page 4: Seminar Report

- 4 -

I. Introduction: Everyone has an information need, and how do they get satisfied? The web seems to have some answers: 21st century’s answer to the library of Alexandria. However, it’s too messy, too disorganized and too fast-changing; the web is Godzilla, Socrates and Jesse Owens all packed into one: huge, smart and fast. We need a catalog, and we need trust. There’s the need for the search engine. “Users started at a search engine 88% of the time when we gave them a new task to complete on the Web.”[ 6]. The key to the success of search engines is their simplicity and comprehensiveness. The difficulty of the search problem is present the top 10 relevant sites. The need is fulfilled if the results of the search point to answers. In other words, the relevance of the results is the key. Moreover, 85% of the time, people don’t look beyond the top 10 results. [ 7]. People make medical, financial, cultural, security-related decisions based on search engine results. Traditionally, search engines have employed ranking algorithms which use the linkage between websites to represent endorsement and have pushed up websites that are referred to by other high ranking websites. PageRank and HITS are 2 such algorithms. Ranking high on a search engine is thus something that fetches high premium. E-commerce, propagandistic and marketing websites have business stake in featuring on top . In this scenario, some web designers want to do all they can to have their pages rank high, artificially. Enter, Web Spam. “Web Spam refers to hyperlinked pages on the WWW which are created with intention of misleading websites.”[ 2] Literature about statistics about the amount of web-spam is limited. [ 8] report: Table 1: Amount of Web-spam

Data Set Crawl Date Data set size Sample-size SpamFetterly et al. 11/02-02/03 150 million pages 751 pages 8.1% Fetterly et. al Yahoo BFS 07/02-09/02 429 million pages 535 pages 6.9% Gyöngyi et al.: Alta Vista set 08/03 31 million pages 748 pages 18% The methods used to web-spam are broadly classified into 2 categories: Link spamming and Term/Content/Text spamming. Link spamming refers to manipulating the in-link &/or out-links of pages and in effect a link substructure of the web, to boost rankings for one’s pages and mislead search engines. Link spamming exploits the weaknesses in traditional ranking algorithms. To boost rankings of a page, spammers induce high-ranking pages to point to them and orchestrate link structures within their own pages to boost rankings of a few target pages. Some spammers resort to even arranging whole collections of sub-domains pointing to each other: setting up spam farms. Term spamming, on the other hand, refers to spamming the text fields on a page to include spam terms to make and thus make them more relevant. Techniques include

Page 5: Seminar Report

- 5 -

dumping, which is the inclusion of a large number of unrelated terms on a page; even whole dictionaries, just so that pages will show up relevance for some obscure terms. To cover up, tell-tale manipulations on spammed pages, so that humans can’t figure it out, spammers use hiding techniques. Popular ones include cloaking, which is serving one version of a page to crawlers and another to human users. This has created a war-zone of the Adversarial Web, where both sides: search-engines and spammers are trying to outwit each other. In this report we concentrate on ways to detect and prevent link-spam. This problem involves ideas not only from areas of Information Retrieval and Machine Learning, but domains as diverse as anthropology, linguistics, political science, and economics, among others. [ 9] The rest of this section outlines the report. The report starts off with a model for web as a graph. In Section III, we take a look at PageRank, and then TrustRank. In Section IV, we come up with a taxonomy for web-spam, which we hope will help bring order to means to tackle each type. In Section IV V, we look at optimal structures for spam-farms and stress on how structures can be created to maximize the rank for desired pages. In Section VI, we look at some algorithms which tweak PageRank to come up with alternative algorithms. In Section VII, we look at statistical features as a potential holy grail for detecting spam. We implement the ranking algorithms: PageRank, TrustRank, DistrustRank, etc. using a stream model. We try to verify the claims in various papers, by testing them on with algorithms.

Page 6: Seminar Report

- 6 -

II. Web Model: We model the web as a graph as G= (V, E) consisting of V as set of pages (vertices) and E as set of directed links (edges). We remove multiple links and selflinks. Consider a simple web-graph. This has 5 pages and 6 links. The number of inlinks of page p is its indegree ι(p), and number of outlinks is its outdegree ω(p). Pages with no inlinks are called unreferenced pages, and those with no outlinks are non-referencing pages.

Transition matrix, T is defined as:

T p,q` a

=0 if q,p` a

26 E1

ω q` a

ffffffffffffffffif q,p` a

2E

X

^

^

^

\

^

^

^

Z

The T for graph in Figure 1: Simple Web-graphis:

T =

0 0 0 0 0

1 0 12fff 1

2fff0

0 1 0 0 0

0 0 12fff 0 0

0 0 0 12fff0

h

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

j

i

m

m

m

m

m

m

m

m

m

m

m

m

m

m

m

k

Inverse Transition Matrix, U is defined as:

U p,q` a

=0 if p,q` a

26 E1

ι q` a

fffffffffffffif p,q` a

2E

X

^

^

^

\

^

^

^

Z

The U for graph in Figure 2: A web-graph: for PageRank calculation. Gray nodes are spam pages. is:

U =

0 13fff0 0 0

0 0 1 0 0

0 13fff0 1 0

0 13fff0 0 1

0 0 0 0 0

h

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

j

i

m

m

m

m

m

m

m

m

m

m

m

m

m

m

m

k

1

4

2

5

3

Figure 1: Simple Web-graph

Page 7: Seminar Report

- 7 -

III. Ranking Algorithms: PageRank [ 1] and HITS [ 10] were the first attempts to provide an importance score to pages and made extensive use of the concept prestige in social network analysis. Based on PageRank, Haveliwala[ 11] came up with topic sensitive PageRank, which paved way for Gyöngyi et al.[ 2] to come up with TrustRank. We take a quick look at PageRank and TrustRank.

A. PageRank: The idea behind PageRank is that a page is important if several pages point to it. Thus a page is influenced by and influences other pages. If each page has to have a rank, intuition would tell us that a page must rank in proportion to ranks of pages pointing in to it, with an inlink signifying a vote. Another thing that would be built into such a scheme is that if a page points to several pages, it should mean that endorsement should be distributed amongst all outlinks in an equal amount. Thus for a page p, the rank r(p) would be:

r p` a

=Pq: q,p` a

2AΕ

r p` a

ω q` a

ffffffffffffffff

This arrangement works fine except in cases where, there are 2 nodes pointing into each other and not anywhere else and one has an inlink, they would end up as a rank-sink.[ 1] Thus, the equation was modified as:

r p` a

= α APq: q,p` a

2E

r p` a

ω q` a

ffffffffffffffff+ 1@α` a

A1Nfffffff

Here, α serves as a damping factor, and the second term serves as a random jump to p from anywhere on the web. [ 1] also elucidates this idea using the random surfer model. The matrix form is:

r = α AT A r + 1@α` a

A1NfffffffA1N

Biased PageRank: Now, instead of using an equi-probable distribution of randomly jump to any page, one

can define our own distribution, by replacing 1NfffffffA1N by vector d. Biased PageRank can

be used to assign a special non-zero scores (which add up to 1) to some pages.[ 11]: r = α AT A r + 1@α

` a

Ad where, d can be initialized to some scores which disseminate to other pages over iterations. Thus, if d contains primarily sports pages, then the biased PageRank will have a sports-based ranking.[ 11].

Page 8: Seminar Report

- 8 -

We use some scilab code to calculate the PageRank scores, taking α=0.85, r1 are unbiased for the graph in the Figure 2, where as r2 is biased with d:

d =

00.70.300

h

l

l

l

l

l

l

j

i

m

m

m

m

m

m

k

; r1 =

0.030.180.180.110.08

h

l

l

l

l

l

l

j

i

m

m

m

m

m

m

k

; r2 =

00.270.280.120.05

h

l

l

l

l

l

l

j

i

m

m

m

m

m

m

k

We also implement PageRank using Java. We plan to verify the topic-sensitive PageRank, by using a corpus on Sports Pages.

Consider another web-graph in Figure 2: For 20 iterations and α=0.85

Page PageRank

7 0.263289826 0.240463132 0.126900234 0.103614373 0.07768268 0.068594721 0.05310745 0.049681770 0.01666667

Table 2: PageRank Scores

0

8

5

1

2

7

4

6

3

Figure 2: A web-graph: for PageRank calculation. Gray nodes are spam pages.

Page 9: Seminar Report

- 9 -

B. TrustRank: [ 2] builds on biased PageRank. The initial d is made to consist of normalized non-zero scores for known good (non-spam) pages. The idea is that goodness (trust) propagates in the forward direction: from known good nodes to nodes that they point to. The main equation remains: r = α AT A r + 1@α

` a

Ad. However, the heart of the algorithm is how to select the d. Seed Set, s: They do this by coming up with a Seed set, s, which is the set of pages that are considered for goodness, initially. This seed set can be obtained by SelectSeed function: applying inverse PageRank on the web-graph, i.e. PageRank on Transpose of web-graph. Oracle function, O(p): Human evaluation is used to decide whether a page p is spam.

This is formalized as: O p` a

=0 ifp is bad1 ifp is good

V

TrustRank algorithm: Input:

• T: transition matrix • N: number of pages • L: limit to oracle invocations • α: decay factor • M: number of PageRank iterations

Output: • rt: TrustRank scores.

Algorithm:

//evaluate seed-suitability of each page 1. s=SelectSeed(…) //Could be inverse PageRank

//generate corresponding ordering 2. σ=Rank({1,…,N}, s)

//select good seeds 3. d=0N 4. for i=1 to L:

a. if O(σ(i)) equals 1 then: i. d(σ(i))=1

//normalize d 5. d=d/|d|

//compute TrustRank scores 6. rt=d 7. for i=1 to M:

a. rt = α ATA r + 1@α` a

Ad 8. return rt

Page 10: Seminar Report

- 10 -

Consider the graph in Figure 3. Note that 5, 6 and 7 are spam nodes. We use Java code and scan the graph as an edge list.

Figure 3:Web graph: for TrustRank calculation. Gray nodes are spam pages

Using inverted PageRank for Select seed we get: We obtain:

Page Inverted

PageRank 2 0.09 4 0.08 3 0.08 0 0.08 1 0.06 7 0.05 8 0.04 6 0.04 5 0.04

s= {0.08, 0.06, 0.09, 0.08, 0.08, 0.04, 0.04, 0.05, 0.04}, which gives: σ= {2, 4, 3, 0, 1, 7, 8, 6, 5}. Taking L=2, we get {2, 4}, which are both good so oracle returns: d= {0, 0, 1/2, 0, 1/2, 0, 0, 0, 0} Assuming, α=.085 and M=20. We obtain TrustRank, rt as shown in We can compare this with PageRank shown in Table 2. We observe that 2 & 4 retain trust, however, 7 and 6, go undetected. Also page 0 is wrongly given low TrustRank. [ 2] report significant effectiveness of TrustRank in detecting spam. [ 2] pay a lot of attention on the systematic way to select seed pages and the rationale behind it.

Edge List

0 1 0 3 1 2 1 8 2 3 2 4 3 4 3 5 4 8 8 2 4 1 5 7 6 7 7 6 4 2

Page TrustRank 2 0.24 4 0.22 7 0.13 6 0.11 3 0.10 8 0.09 1 0.06 5 0.04 0 0.00

Table 3: TrustRank Scores

0

8

5

1

2

7

4

6

3

Page 11: Seminar Report

- 11 -

IV. Web Spam Taxonomy: A first step in gearing up for the counter-measures it would be prudent to understand the spammers’ ‘arsenal’. This section elucidates the attempts to organize web-spamming techniques into a taxonomy. It also briefly brushes over published statistics about web-spam. There have been discussions in literature and on the web, but we draw heavily from [ 4]. We use two terms: importance: the ranking of a page in general, and relevance: the ranking of a page with respect to a specific query.

Figure 4: Web-Spam taxonomy1

1 We modify the taxonomy proposed by [ 4] for our purpose.

Boosting Techniques

Link Spamming Term Spamming

Inlink Outlink

Honey Pot

Directory

Wiki

Link Exchange

Expired Domains

Farm

Dir. clone

Dumping Weaving

Page 12: Seminar Report

- 12 -

A. Link Spamming Techniques: To delve into link spamming let’s categorize pages according to the way they can be manipulated by spammers to influence results:

a. Inaccessible pages: Spammers cannot modify these pages. However, they can point to them.

b. Accessible pages: These pages don’t belong to the spammer, but they can modify the content on these pages, in a limited manner. Typical examples are: wikis, comments on blogs.

c. Own pages: The spammer wants to boost ranking of one or more of these pages: target pages, t. These have a cap on budget (e.g. web hosting, etc.).

The target algorithms: HITS, PageRank, TrustRank, etc. HITS: HITS ranks hubs and authority pages.[ 11] For HITS, the spammer can easily obtain high hub scores by adding outlinks to popular websites. Some spammers even pay users of high ranked .edu authority pages to point to their spammy pages. The spammer can obtain high authority scores by having his unscrupulous hub pages point into a page which can now become a hub page.

Figure 5: Spamming for HITS

Page Rank: For any A set of pages, the PageRank is given as[ 12]: PR(A)=PRstatic(A)+ PRin(A)- PRout(A)- PRsink(A) This is explained in the Section V. Techniques: Outgoing links: The spammer can manually add well-known links, but the smarter option is Directory Cloning: copying entire directory sites like DMOZ Open Directory into ones pages. Incoming links:

• Creating a Honey-pot:

Page 13: Seminar Report

- 13 -

The idea is to provide some useful resources (e.g. Articles or documents), and have good sites link into you. These goodies could themselves be stolen (e.g. having a copy of wikipedia with all outlinks changed to ones page). • Infiltrating a web directory: Directories are usually highly ranked and a spammer can trick a webmaster to allow links into their pages to do the trick. • Wikis, blogs, guest-books, unmoderated message-boards: A quick fix has been some tools and some bloggers maintaining white-lists of commentors but all these make it difficult to obtain feedback and affect the way people blog. • Link Exchange: Spammers sometimes resort to mutual promotion. • Expired domains: Spammers take advantage of high-ranks conveyed by old-links. • Create own spam farm: Spammers battle ever-new prevention techniques by having link structures to which popular algorithms have vulnerabilities. They own large number of domains these days.

B. Term Spamming: There are several fields on a web page which can be relevant to a query, these include: body, title, meta tag, anchor and url. Rigging up text-fields to make pages relevant is term-spamming. Target algorithm: TFIDF metric[ 13] has been used in information retrieval. The TFIDF score of a page p for a query q is computed as sum over every term common to p and q: TFIDF(p,q)=TF(t)*IDF(t) TF->the number of times a given term appears in that document. It is normalized. IDF->The inverse document frequency is a measure of the general importance of the term. Thus, spammers can try to make a page relevant to many pages: my including large number of distinct terms or relevant to the some specific query by repeating some particular terms. Some prominent techniques: Anchor-tag spam: This means spamming links that point to target spam page. Thus, it affects the ranking of both source and target.

<a href=”target.html”>free, cheap, mortgage, free</a> Dumping: Spammers build pages containing large number of terms, even whole dictionaries. This makes them relevant to at least some term. Weaving:

Page 14: Seminar Report

- 14 -

Interleave spam into relevant content. e.g. host wikipedia-clone and randomly insert your repeating term throughout the page.

C. Hiding Techniques: It is important for spammers to conceal their intent from a human visitor. Two techniques used here are: Content Hiding: This involves making the spam invisible from the page. This can be done by changing background color or by having the 1x1 pixel. Cloaking: Spammers can provide one version to humans and different one to crawlers. This is done by keeping track of IP addresses of crawlers and serving them different content.

Page 15: Seminar Report

- 15 -

V. Optimal Spamming Structures: There has been sizable literature on how link structures should be organized to spam. It is open to question, that whom this is to serve the researchers or the spammers. We present here, a small summary of some notable work in this area. We use some mathematical equations here, but omit the proofs. Optimal Spam Link Farms: First we look at some optimal spam farm structures, so as to boost rank of a set of target pages, t, using some boosting pages, b, which are under the spammer’s control and hijacked pages, h, which are not controlled by the spammer, but where he can place some outlinks. The rank contributed by hijacked pages is leakage, λ. Single-target spam farm model: Consider a single-target spam farm model [ 14]. The score of the single-target is maximal if:

• 8bi 2b,bi points to t, and t alone A • 8bi ,b j 2b,: 9link from bi to b j • t points to some or all bi 2b • 8hi 2h,hi points to t

In fact it has been shown that leakage has same effect as boosting pages and need not be treated otherwise [ 14]. Alliances of 2 spam-farms: Let us consider the case where a group of spammers already have spam-farms and want to mutually boost their ranking, by interconnecting. [ 14] elucidate that, the optimal way to link two spam-farms is to connect the two target pages and remove all links to boosting pages. Web Rings and Complete Cores: [ 14] present analysis of extending the idea of alliances to more than 2 spam-farms. The set of target pages is called the core. They have found that ring and complete sub-graph of target pages both yield PageRank for target pages higher than possible for optimal unconnected page.

p2

t p3

pk

p1

λ

p1

tp

p2

pk

tq

q1

q2

q3

Figure 6:Single-Target Spam farm

Figure 7: Alliance of 2 spam-farms

Page 16: Seminar Report

- 16 -

[ 14] also explore the budgetary and other considerations for entering and leaving a spam-farm. We took to verifying these claims using Java code. Edge List Vertex Page Ranks

0 1 3 0.1809871740 3 2 0.1745056991 2 4 0.1677511571 3 5 0.1506297332 3 1 0.0950442522 4 8 0.0879609193 4 7 0.0671100033 5 6 0.0593450964 8 0 0.0166666678 2 4 1 5 6 6 7 7 5 5 3 5 2

Edge list Vertex PageRank

0 1 5 0.210012050 3 6 0.195176941 2 7 0.182567111 3 4 0.093833562 3 2 0.091773152 4 3 0.089796023 4 1 0.063629263 5 8 0.056545934 8 0 0.016666678 2 4 1 5 6 6 7 7 5

Edge List Vertex PageRank

0 1 7 0.27905910 3 6 0.253867021 2 4 0.093833561 3 2 0.091773152 3 3 0.089796022 4 1 0.063629263 4 8 0.056545933 5 5 0.054829984 8 0 0.016666678 2 4 1

Page 17: Seminar Report

- 17 -

5 7 6 7 7 6

Edge List Vertices PageRank

0 1 7 0.207752280 3 9 0.199722271 2 10 0.191825971 3 8 0.189089642 3 4 0.037495412 4 3 0.036750773 4 1 0.033748053 5 11 0.028435554 1 5 0.028119085 7 2 0.022061956 7 6 0.01257 8 0 0.01258 7 1 10 4 11

10 9 11 9

9 10 Edge List Vertices PageRank

0 1 7 0.265548470 3 9 0.263453751 2 10 0.134029781 3 8 0.125358162 3 4 0.037495412 4 3 0.036750773 4 1 0.033748053 5 11 0.028435554 1 5 0.028119085 7 2 0.022061956 7 6 0.01257 8 0 0.01258 7 1 10 4 11

10 9 11 9

9 10 7 9 9 7

Page 18: Seminar Report

- 18 -

VI. Tweaking Ranking Algorithms: a solution to link-spam?

A. Antitrust and Distrust Ranking: [ 5, 3, 15] have suggested a distrust back propagation method. Intuitively, just as trust disseminates forward from a set of known good pages, distrust can also be imagined to move out of a seed of known spam pages; however, distrust should propagate backward. The idea is that pages pointing to spam pages are themselves very likely to be spam. The algorithm is analogous to TrustRank. Step1: Seed->to find seed pages, PageRank can be used. Step2: G`: transpose of the web graph needs to be computed. Step3: The biased PageRank algorithm is applied on G` Let us apply the Distrust Rank on graph in Figure 3. We take L=2. From Table 2 we know that seed set {7, 6}, will be selected: little surprise that they pass on distrust to 5, which the only page pointing in to them. [ 3] report that Antitrust Rank algorithms have a higher chance over TrustRank of finding high PageRank spam pages, as they start with seed set of spam pages with high PageRank.

B. Combining Trust and Distrust: How does one combine trust and distrust? One method, which we are exploring, is to use some sort of weighted sum of trust and distrust. This we plan to do on the dataset of the web scanned by Webaroo, a company based at IITB. [ 15] gives a naïve discussion on combing Distrust Rank and TrustRank. Webaroo seem to have interesting hack on combining Distrust and Trust. [ 9] present an in-depth abstract framework for trust and distrust propagation. Exploring further over here might lead to some answers.

C. Truncated PageRank: Intuitively, if we ignore the direct contribution of the first few levels of links, then we would get a true picture of the real rank of pages.[ 16] Spammers can afford to influence only a few levels, and algorithms should be able to see through them easily. They suggest generalization to the PageRank equation to:

r =Xv = 0

1 damping v` a

Nfffffffffffffffffffffffffffffffffffffff

ATv

Table 4:Distrust Scores

Page DistrustRank7 0.226 0.175 0.093 0.092 0.050 0.054 0.031 0.028 0.01

Page 19: Seminar Report

- 19 -

where, damping decreases with v. For PageRank,

damping v` a

= 1@α` a

α v

To demote the immediate supporters, the damping function may be redefined as:

damping v` a

= 1@α` a

α v for v> V0 otherwise

V

Truncated PageRank is easily obtained by using snapshots of PageRank.

r 0` a

=CNfffffff; r v

` a

= αr v@ 1` a

; rtrunc = Xv = V + 1

1

r v` a

where, r v` a

is PageRank snapshot at instant v. [ 16] themselves explain that Truncated PageRank is not intended to replace PageRank, but in fact can be used as feature to classify spam pages, as discussed in Section VII.

D. Topical TrustRank: [ 17] propose to calculate TrustRank scores for different topics, instead of single TrustRank score, with each score representing the trustworthiness of the site within that particular topic. We believe a combination of these scores will present a better measure of the trustworthiness of a site. The interesting part is how to combine different TrustRank scores: simple summation and quality bias. Quality bias is to weight each topic, possibly based on its average PageRank. They also emphasize on seed selection: suggesting seed weighting instead of assigning equal weights to seeds in d. Quality bias unfortunately is no answer to the problem of combining trust and distrust.

v

dam

ping

(v)

Figure 8: truncated damping function

Page 20: Seminar Report

- 20 -

VII. Statistics about pages: features to classify link-spam:

Figure 9: Distribution of in-degree and out-degree of web-pages[ 18]

[ 8] report this distribution of in-degree and outdegree of pages. They also suggest that in distributions of statistics about pages the outliers tend to be spam. Different features of pages and link-structures can be used to classify spam pages. [ 18]2 use:

• Degree-based measures • PageRank • TrustRank • Truncated PageRank • Estimation of supporters

They compute these metrics for only the page with maximum PageRank. it would be an interesting exercise to cross-check their findings, and come up with similar features that can be tested independently or in combination with these.

2 δ is a parameter they use to measure significance for Kolmogorov-Smirnov tests.

Page 21: Seminar Report

- 21 -

Degree-based measures: They find no significant difference between in-degree and out-degree of spam pages and normal pages. They also report that edge-reciprocity shows a marked difference. PageRank:

Figure 10: <Left>: Histogram of PageRank of normal and spam pages. <Right>: Histogram of

Standard deviation of PageRank of neighbors.[ 18]

[ 18] find that most spam homepages are in particular narrow PageRank strip. Also, PageRank of neighbors of homepage show little dispersion. TrustRank: TrustRank scores also have a marked difference. Combinations of degree-correlations, PageRank and TrustRank seem to yield good results.[ 18] Truncated PageRank:

Truncated PageRank proves particularly useful (see Figure 12: Histogram of Truncated PageRank(V=4)/PageRank.[ 18])3. This means that

spam pages loose a large part of their score when truncated at level of 4.

3 The T in the figure is V, the threshold.

Figure 11: Histogram of TrustRank of Normal and spam pages.[ 18]

Figure 12: Histogram of Truncated PageRank(V=4)/PageRank.[ 18]

Page 22: Seminar Report

- 22 -

VIII. Scope for Future work: We believe there is a lot of scope in exploring trust and distrust propagation and algorithms that can combine the 2 might do well. The analysis of Optimal Structures to spam ranking algorithms can provide direction to better them. Combinations of different statistical features might throw up some nice surprises, regarding differences between distributions for spam and normal pages. Useful inputs from other fields like economics and game theory might open up new vistas. Some literature on monetary constraints and analysis of sponsored search has started to surface.

IX. Conclusion: One may ponder that for how long will we be able to use links as endorsements and whether someday search engines will stop using them altogether. Combating web-spam seems to need a combination of term and link spam detection techniques. Hopefully, these approaches are not just orthogonal, but complimentary. The spammers’ and search engines’ goals will for sometime ataleast remain conflicting: creating the adversarial web.

Page 23: Seminar Report

- 23 -

X. References:

1. L. Page, S. Brin, R. Motwani, T. Winograd, "The PageRank citation ranking: Bringing order to the Web," 1998.

2. Zoltán Gyöngyi, Hector Garcia-Molina, Stanford University, and Jan Pedersen, Yahoo. Proceedings of the 30th VLDBConference, 2004.

3. Krishnan, V. and Raj, R., “Web Spam Detection with Anti-Trust Rank”, December 2005.

4. Zoltán Gyöngyi, Hector Garcia-Molina, “Web Spam Taxonomy”, In First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005.

5. Panagiotis T. Metaxas, Joseph DeStefano, "Web Spam, Propaganda and Trust". 6. Jakob Nielsen, “When Search engines become answer engines”. August 2004.

http://www.useit.com/alertbox/20040816.html 7. Craig Silverstein, Monika Henzinger, Hannes Marais, Michael Moricz, “Analysis

of a Very Large Web Search Engine Query Log”, 1999. 8. Dennis Fetterly, Mark Manasse, Marc Najork. “Spam, Damn Spam, and

Statistics: Using statistical analysis to locate spam web pages”, WebDB 2004. 9. R. Guha, Ravi Kumar, Prabhakar Raghavan, Andrew Tomkins, “Propagation of

Trust and Distrust”, 2004. 10. J.M. Kleinberg, “Autoritative sources in a hyperlinked environment”. 1999. 11. T. Haveliwala. “Topic-sensitive PageRank”, WWW, 2002. 12. Monica Bianchini, Marco Gori, and Franco Sacrselli”, Inside Page Rank”, 2005. 13. R. Baeza-Yates and Berhtier Ribeiro-Neto, “Modern Information Retrieval”,

Addison Wesley, 1999. 14. Zoltán Gyöngyi, Hector Garcia-Molina, “Link Spam Alliances”. 15. BadRank. http://pr.efactory.de/e-pr0.shtml 16. L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates, “Using

rank propagation and probabilistic counting for link-based spam detection. Technical report”, 2006.

17. B. Wu, V. Goel, and B. D. Davison, “Topical TrustRank: Using topicality to combat web spam”, WWW, May 2006.

18. L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates, “Link-based characterization and detection of web spam”, AIRWeb, 2006.


Related Documents