Top Banner
Master Thesis Defense December 16, 2010 Database and Multimedia Lab Korea Advanced Institute of Science and Technology (KAIST) Improving the Quality of Web Spam Filtering by Using Seed Refinement Presenter: Qureshi, Muhammad Atif Advisor: Whang, Kyu-Young
48

Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Aug 31, 2014

Download

Technology

M Atif Qureshi

My Master's thesis defense slides for Master's thesis, research for which was conducted under Prof. Kyu-Young Whang and successfully defended in KAIST, Computer Science Dept. on 16th December, 2010.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Master Thesis Defense

December 16, 2010

Database and Multimedia LabKorea Advanced Institute of Science and Technology (KAIST)

Improving the Quality of Web Spam Filtering by Using Seed Refinement

Presenter: Qureshi, Muhammad AtifAdvisor: Whang, Kyu-Young

Page 2: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 2

Contents Introduction

Related Work

Web Spam Filtering Using Seed Refinement Algorithms Strategy

Performance Evaluation

Conclusion

Apr 7, 2023

Page 3: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 3

Web Search Engine Definition [BP98]

A system that retrieves relevant web pages for users’ queries from the World Wide Web (WWW).

ExampleGoogle, Yahoo!, MS Live Search, Naver.

Apr 7, 2023

Introduction

Page 4: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 4

Web Page Ranking Motivation

User queries return huge amount of relevant web pages, but the users want to browse the most important ones.

Note: Relevance represents that a web page matches the user’s query.

ConceptOrdering the relevant web pages according to their importance [GMT04].

Note: Importance represents the interest of a user on the relevant web pages.

Methods [ACG01]

Link-based method: exploiting the link structure of web for ordering the search results. Content-based method: exploiting the contents of web pages for ordering the search results.

We focus on link-based methods since these methods are prevalent in

popular search engines [BP98, CDG07, YUT08] such as Google and Yahoo! [YUT08].

Apr 7, 2023

Introduction

Page 5: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 5

Link Structure of Web [GGP04]

Concept Web can be modeled as a graph G(V, E) where V is a set of vertices representing web

nodes, and E is a set of edges representing directed links between the nodes.

Note: Web node represents either a web page or a web domain. Links are classifed into two classes as follows:

The link structure is called web graph.

Example

Introduction

V = {A, B, C}

E = {AB, BC}

AB is an outlink of the web node A.

BC is an outlink of the web node B.

AB is an inlink of the web node B.

BC is an inlink of the web node C.

A CB

Inlink: the incoming link to a web node. Outlink: the outgoing link from a web node.

Fig. 1: An example of a web graph.

Page 6: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 6

Web Page Ranking by Using the Link-based Methods

Concept [BP98]

A web node is more important if it receives more inlinks.

Popular method: PageRank [BP98]

Apr 7, 2023

][)1()(

][][),(:

pvdqN

qPRdpPREpqq outlink

PR[p]: PageRank value of the web node p

Noutlink(q): the number of outlinks of the web node q

d: damping factor (probability of following an outlink)

v[p]: the probability of random jump from the web node p to any arbitrary web node

Introduction

Page 7: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 7

Web Spam [HMS02, GG05]

ConceptAny deliberate action in order to boost a web node’s rank, without improving its real merit.

Link spam: web spam against link-based methods An action that changes the link structure of web in order to boost web node's ranking. Example

Apr 7, 2023

Introduction

N3

N4

N1 N2

The web nodes N1 and N2 are not involved in link

spam, so they care called non-spam nodes

N5

Nx

Web nodes N3-Nx are involved in link spam, so

they are called spam nodes

Node Link Actor

Actor creates

the web node

N 3 to N x

I want to boost the rank of the web node N3

Fig. 2: An example of link spam.

Page 8: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 8

Web Spam Filtering Algorithm Overview

The web spam filtering algorithms output spam nodes to be filtered out [GBG06]. In order to identify spam nodes, a web spam filtering algorithm needs spam or non-spam

nodes (called input seed sets) as an input [GGP04, KR06, GBG06, WD05].

Spam input seed set: the input seed set containing spam nodes. Non-spam input seed set: the input seed set containing non-spam nodes.

The input seed set can be used as the basis for grading the degree of whether web nodes are spam or non-spam nodes [GGP04, KR06, GBG06].

Observation The output quality of web spam filtering algorithms is dependent on that of the input seed

sets. The output of the one web spam filtering algorithm can be used as the input of the other web

spam filtering algorithm.

The algorithms may support one another if placed in appropriate succession.

Introduction

Page 9: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 9

Motivation and Goal Motivation

There is no well-known study which addresses the refinement of the input seed sets for web spam filtering algorithms.

There is no well-known study on successions among web spam filtering algorithms.

Goal Improving the quality of web spam filtering by using seed refinement. Improving the quality of web spam filtering by finding the appropriate succession among web

spam filtering algorithms.

Apr 7, 2023

Motivation and Goal

Page 10: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 10

Contributions We propose modified algorithms that apply seed refinement techniques using

both spam and non-spam input seed sets to well-known web spam filtering algorithms.

We propose a strategy that makes the best succession of the modified algorithms.

We conduct extensive experiments in order to show quality improvement for our work. We compare the original(i.e., well-known) algorithms with the respective modified algorithms. We evaluate the best succession among our modified algorithms.

Apr 7, 2023

Contributions

Page 11: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 11

Related Work There are two research directions related to the Web spam.

1. Evaluating either the goodness or badness of web nodes [GGP05, KR06].

TrustRank and Anti-TrustRank are well-known algorithms. These two algorithms can be used for refining input seed sets.

2. Detecting spam nodes [GBG06, WD05].

Spam Mass and Link Farm Spam are well-known algorithms. These two algorithms can be used for identifying Web Spam.

We classify web spam filtering algorithms into two types of algorithms Seed refinement algorithms (e.g., TrustRank and Anti-Trust Rank). Spam detection algorithms (e.g., Spam Mass and Link Farm Spam).

Apr 7, 2023

Note: Existing work exploit web graph whose web node represents a domain [GBG06, WD05].

Related Work

Page 12: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 12

TrustRank Overview [GGP04]

Trusted domains(e.g., well-known non-spam domains such as .gov and .edu) usually point to non-spam domains by using outlinks.

Trust scores are propagated through the outlinks of trusted domains. Domains having high trust scores(≥threshold) at the end of propagation are declared as non-

spam domains.

Example

ObservationTrust scores can propagate to spam domains if trusted domain outlinks to the spam domains.

Apr 7, 2023

1

2

31/2

t(1)=1

t(2)=1

t(3)=5/6

1/2

1/31/3

1/3

5/12

5/12

4t(4)=1/3

A seed non-spam domain

t(i): The trust score of domain i

The domain 3 gets trust scores from the domains 1 and 2.

A domain being considered

Fig. 3: An example for explaining TrustRank.

Related Work

Page 13: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 13

Anti-TrustRank Overview [KR06]

Anti-trusted domains (e.g., well-known spam domains) are usually pointed by spam domains by using inlinks.

Anti-trust scores are propagated by the inlinks of anti-trusted domains. Domains having high anti-trust scores(≥threshold) at the end of propagation are declared as

spam domains.

Example

ObservationAnti-trust score can propagate to non-spam domains if a non-spam domain outlinks to spam domain.

Apr 7, 2023

1

2

31/2

at(1)=1

at(2)=1

at(3)=5/6

1/2

1/3

1/3

1/3

5/12

5/12

4at(4)=1/3

A seed spam domain

at(i): The anti-trust score of domain i

The domain 3 gets anti-trust scores from the domains 1 and 2.

A domain being considered

Fig. 4: An example for explaining Anti-TrustRank.

Related Work

Page 14: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 14

Spam Mass Overview [GBG06]

A domain is spam if it has excessively high spam score. Spam score is estimated as subtraction from a PageRank score to a non-spam score. Non-spam score is estimated as a trust score computed by TrustRank.

Example

Observation Since the Spam Mass has use TrustRank, it has inherently the same problem as TrustRank does.

Apr 7, 2023

1

25

3A seed non-spam domain

A domain being considered

The domain 5 receives many inlinks but only one indirect inlink from a

non-spam domain.

4

76

Fig. 5: An example for explaining Spam Mass.

Related Work

Page 15: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 15

Link Farm Spam Overview [WD05]

A domain is spam if it has many bidirectional links with domains. A domain is spam if it has many outlinks pointing to spam domains.

Example

Observation Link Farm Spam does not take any input seed set. A domain can have many bidirectional links with trusted domains as well.

Apr 7, 2023

Related Work

2

1 345

A domain being considered

The domains 1, 3, and 4 have two directional links.

Fig. 6: An example for explaining Link Farm Spam.

Page 16: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 16

Web Spam Filtering Using Seed Refinement

Objectives Decrease the number of domains incorrectly detected as belonging to the class of non-spam

domains (called False Positives). Increase the number of domains correctly detected as belonging to the class of spam domains

(called True Positives).

Our approaches We modify the spam filtering algorithms by using both spam and non-spam domains in order

to decrease False Positives. We use non-spam domains so that their goodness should not propagate to spam domains. We use spam domains so that their badness should not propagate to non-spam domains.

We make the succession of these algorithms in order to increase True Positives. We make the succession of the seed refinement algorithm followed by the spam detection algorithm so

that the spam detection algorithm uses the refined input seed sets, which is produced by the seed refinement algorithm.

Apr 7, 2023

Page 17: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 17

Modified TrustRank Modification

Trust score should not propagate to spam domains.

Example

Apr 7, 2023

Modifications

1

2

31/2

t(1)=1

t(2)=1

t(3)=5/6

1/2

1/31/3

1/3

5/12

5/12

A seed non-spam domain

t(i): The trust score of domain iThe domains 5 and 6 are involved in Web spam.

A domain being consideredt(5)=5/12 +

5 6

4t(4)=1/3

t(6)=5/12 + …

5/12

5/12

A seed spam domain

Fig. 7: An example explaining Modified TrustRank.

Page 18: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 18

Modified Anti-TrustRank Modification

Anti-Trust score should not propagate to non-spam domains.

Example

Apr 7, 2023

Modifications

1

2

31/2at(1)=1

at(2)=1

at(3)=5/6

1/2

1/3

1/3

1/3

5/12

5/12

4

The domains 5 ,6 and 7 are non- spam domains.

at(5)=5/12

at(6)=5/12 + …

56

at(i): The anti-trust score of domain i

A domain being considered

A seed spam domain

75/12

at(4)=1/3

5/12

5/12 at(7)=5/12 + … A seed non-spam domain

Fig. 8: An example explaining Modified Anti-TrustRank.

Page 19: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 19

Modified Spam Mass Modification

Use modified TrustRank in place of TrustRank.

Example

Apr 7, 2023

Modifications

1

25

3A seed non-spam domain

A domain being considered

The domain 5 receives many inlinks4

76

but only one indirect inlink from a non-spam domain.

A seed spam domain

Fig. 9: An example explaining Modified Spam Mass.

Page 20: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 20

Modified Link Farm Spam Modification

Use two types (i.e., spam and non-spam domain) of input seed sets. A domain having many bidirectional links with only trusted domains is not detected as a spam

domain.

Example

Apr 7, 2023

Modifications

2

1 345

A domain being considered

The domains 1, 3, and 4 have two directional links.

Fig. 10: An example explaining Modified Link Farm Spam.

A seed non-spam domain

6 87

Page 21: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 21

Strategy to Make Succession of Modified Algorithms

Overview We make the succession of the seed refinement algorithms (simply, Seed

Refiner) followed by the spam detection algorithms (simply, Spam Detector).

We also consider the execution order of algorithms belonging to Seed Refiner and Spam Detector, respectively.

Apr 7, 2023

Strategy

Seed Refiner

Spam Detector

Detected spam domains

Class

Data flow

Refined spam and non-spam

domains

Manually labeled spam and non-spam

domains

Fig. 11: The strategy of succession.

Strategy Consideration of the execution order in Seed Refiner.

Modified TrustRank followed by Modified Anti-TrustRank.

Modified Anti-TrustRank followed by Modified TrustRank.

Consideration of the execution order in Spam Detector.

Modified Spam Mass followed by Modified Link Farm Spam.

Modified Link Farm Spam followed by Modified Spam Mass.

Page 22: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 22

Performance Evaluation Purpose

Show the effect of seed refinement on the quality of web spam filtering. Show the effect of succession on the quality of web spam filtering.

Experiments We conduct two sets of the experiments according to the two purposes as mentioned above.

Apr 7, 2023

Performance Evaluation

Table. 1: Summary of the experiments.

Experimental Sets Experiments Parameters

Set 1: Comparisons for showing the effect of

refining seed

Exp.1 Comparison between TR (TrustRank) and MTR (Modified TrustRank)

cutoffTr 0% − 300%ratioTop 10%, 50%, 100%damp 0.85

Exp.2 Comparison between ATR (Anti-TrustRank) and MATR (Modified Anti-TrustRank)

cutoffATr 0% − 300%ratioTop 10%, 50%, 100%damp 0.85

Exp.3 Comparison between SM (Spam Mass) and MSM (Modified Spam Mass)

relativeMass 0.7 − 1.0topPR 10%, 50%, 100%damp 0.85

Exp.4 Comparison between LFS (Link Farm Spam) and MLFS (Modified Link Farm Spam)

limitBL 2 − 7limitOL 2 − 7

Set 2: Comparisons for showing the effect of ordering executions

Exp.5 Finding the best succession for the seed refinercutoffTr 50%, 75%, 100%cutoffATr 100%damp 0.85

Exp.6 Finding the best succession for the spam detector

relativeMass 0.8 − 0.99topPR 100%limitBL 7limitOL 7damp 0.85

Exp.7 Comparison among the best succession, the best known algorithm, and best modified algorithm

relativeMass 0.8 − 0.99topPR 100%limitBL 7limitOL 7damp 0.85

Page 23: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 23

Experimental Parameters

Apr 7, 2023

Table. 2: Parameters used in experiments.

Performance Evaluation

Parameters Descriptiondamp It is a parameter used in TR, MTR, ATR, and MATR. It is the probability of following an outlink.

RatioTop

It is the ratio for determining the input seed sets in TR, MTR, ATR, and MATR.Specifically, from Spam (or Non-Spam) Seed Set, we retrieve the domains whose PageRank scores are larger than or equal to the PageRank score of top-Ratiotop% domain in the entire domains, and then, use the domains as the input seed set.

cutoffTrIt is the cutoff threshold in TR and MTR for declaring the number of non-spam domains. In this thesis, we decide the value of cutoffTr proportional to the size of input seed set of the non-spam domains.

cutoffATrIt is the cutoff threshold in ATR and MATR for declaring the number of spam domains. In this thesis, we decide the value of cutoffATr proportional to the size of input seed set of the spam domains.

relativeMassIt is a threshold used in SM and MSM for deciding a domain as a spam such that, if the domain receives excessively higher spam score compared to the non-spam score, the domain is one of the candidates for Web spam.

topPRIt is a threshold used in SM and MSM for deciding the candidate of being a spam domain by comparing the PageRank score of the domain to be within the top percentage (i.e., topPR %) of the PageRank scores.

limitBL It is a threshold used in LFS and MLFS for declaring the domain as spam, if the number of bidirectional links of the domain is equal to or greater than this threshold.

limiOL It is a threshold used in LFS and MLFS for declaring the domain as spam, if the number of outlinks of a domains pointing to the spam domains is equal to or greater than this threshold.

Page 24: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 24

Experimental data [BCD08] [CDB06] [CDG07]

Experimental Data

Domains Web Pages

LabeledSpam 1,924

Total77.9

MillionNon-Spam 5,549

Unlabeled Unknown 3,929Total 11,402

Apr 7, 2023

Performance Evaluation

Seed Set Test SetLabeled Spam Domains 674 1,250Labeled Non-Spam Domains 4,948 601

Table. 3: Characteristics of the data set in terms of domains and web pages.

Table. 4: Classification of the data set as Seed Set and Test Set.

Page 25: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 25Apr 7, 2023

Experimental MeasurePerformance Evaluation

Measures Description

True positivesThe number of domains correctly labeled as belonging to the class (i.e., spam or non-spam). [BCD08]

False positivesThe number of domains incorrectly labeled as belonging to the class (i.e., spam or non-spam). [BCD08]

F-measure

The combined representation of precision and recall. Precision, recall [SM86], and F-measure are expressed as follows.

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠+ 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠

𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠+ 𝑓𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠2

𝐹 𝑚𝑒𝑎𝑠𝑢𝑟𝑒= 2× 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 Table. 5: Description of the measures.

1False negatives are the number of domains incorrectly labeled as not belonging to the class (i.e., spam or non-spam).

1

Page 26: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 26

Comparison between Originaland Modified Algorithms (1/3)

Apr 7, 2023

Performance Evaluation

Experiment 1: Comparison Between TR and MTR MTR performs either comparable to or slightly better than TR in terms of both true positives and

false positives.

We find cutoffTr effective till 100% mark indicating that after 100% detection becomes unstable in terms of false positives.

For later experiments, we fix the cutoffTr range till 100%.

Experiment 2: Comparison Between ATR and MATR MATR generally performs better than ATR in terms of true positives

We find cutoffATr effective till 180% mark indicating that after 100% detection becomes unstable in terms of false positives.

For later experiments, we fix the cutoffATr at 100% to ensure high precision.

Page 27: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 27

Comparison between Originaland Modified Algorithms (2/3)

Experiment 3: Comparison Between SM and MSM MSM performs slightly better than SM in terms of true positives and comparable in terms of

false positives

We find relativeMass effective between the range of 0.95 to 0.99 in terms of maximizing true positives and minimizing false positives.

For later experiments, we keep the range from 0.8 to 0.99 of relativeMass as effective range.

Experiment 4: Comparison Between LFS and MLFS MLFS performs better than LFS in terms of false positives while at some expense of true

positives. We find limitBL and limitOL highly effective at 7 and 7 respectively in terms of minimizing

many false positives.

For later experiments, we keep limitBL = 7 and limitOL = 7.

Apr 7, 2023

Performance Evaluation

Page 28: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 28

Comparison between Originaland Modified Algorithms (3/3)

Summary We have found all modified algorithms providing better quality than the respective original

algorithms. We found SM as the best original web spam detection algorithms among ATR, SM, and LFS

algorithms due to high true positives and relatively less false positives. We also found MSM as the best modified web spam detection algorithms among MATR, MSM,

and MLFS algorithms due to high true positives and relatively less false positives.

Apr 7, 2023

Performance Evaluation

Page 29: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 29

True Positives False Positives

For Finding Refined Non-

Spam Domains

For Finding Refined Spam

Domains

The Best Succession for the Seed Refiner

Apr 7, 2023Therefore, MATR-MTR is found to be the winner, and hence we select it as the seed refiner.

Performance Evaluation

Identical performance for both successions Identical performance for both successions

Identical performance for both successionsBetter performance for MATR-MTR compared toMTR-MATR

Table. 6: Comparison for the seed refiner.

Page 30: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 30

The Best Successionfor the Spam Detector

Comparison We pick 0.99 of relativeMass since false positives are minimum at this value compared to other

values of relativeMass while true positives are almost comparable for all values of relativeMass. We observe MLFS fails to detect considerable number of spam domains. We obtain the precisions 0.86, 0.86, 0.93, and 0.87 for MLFS-MSM, MSM-MLFS, MLFS, and MSM

respectively. We obtain the recalls 0.80, 0.80, 0.33, and 0.76 for MLFS-MSM, MSM-MLFS, MLFS, and MSM

respectively. MLFS-MSM and MSM-MLFS are best and identical in performance, we choose MLFS-MSM as the

best spam detector without loss of generality.

Apr 7, 2023

Performance Evaluation

Fig. 12: Comparison for the spam detector.

Page 31: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 31

Comparison We pick 0.99 of relativeMass since false positives are minimum at this value compared to other

values of relativeMass while true positives are almost comparable for all values of relativeMass. We observe MATR-MTR-MLFS-MSM finds more true positives and some more false positives. We obtain the precisions 0.85, 0.86, and 0.86 for SM, MSM, and MATR-MTR-LFS-MSM

respectively. We obtain the recalls 0.64, 0.70, and 0.80 for SM, MSM, and MATR-MTR-LFS-MSM respectively.

Comparison among the Best Succession, theBest Known Algorithm and the Best Modified

Algorithm

Apr 7, 2023

Fig. 13: Comparison among MATR-MTR-MLFS-MSM, SM, and MSM.

Therefore, MATR-MTR-MLFS-MSM is more effective.

Page 32: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 32

Conclusions

We have improved the quality of web spam filtering by using seed refinement We have proposed modifications in four well-known web spam filtering algorithms.

We have proposed a strategy of succession of modified algorithms Seed Refiner contains order of executions for seed refinement algorithms. Spam Detector contains order of executions for spam detection algorithms.

We have conducted extensive experiments in order to show the effect of seed refinement on the quality of web spam filtering We find that every modified algorithm performs better than the respective original algorithm. We find the best performance among the successions by MATR followed by MTR, MLFS, and MSM (i.e.,

MATR-MTR-MSM). This succession outperforms the best original algorithm i.e., SM, by up to 1.25 times in recall and is comparable in terms of precision.

Apr 7, 2023

Page 33: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 33

References (1/2)[ACG01] Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., and Raghavan, S., “Searching the Web,” ACM Transactions on

Internet Technology (TOIT), Vol. 1, No. 1, pp. 2-43, Aug. 2001.

[BP98] Brin, S. and Page, L., “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” In Proc. 7th Int'l Conf. on World Wide Web (WWW), pp. 107-117, Brisbane, Australia, Apr. 1998.

[BCD08] Becchetti, L., Castillo, C., Donato, D., Baeza-YATES, R., and Leonardi, S., “Link Analysis for Web Spam Detection,” ACM Transactions on Web (TWEB), Vol. 2, No. 1, pp. 1-42, Mar. 2008.

[CDB06] Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., and Vigna, S., “A Reference Collection for Web Spam,” SIGIR Forum, Vol. 40, No. 2, pp. 11-24, Dec. 2006.

[CDG07] Castillo, C., Donato, D., Gionis, A., Murdock, V., and Silvestri, F, “Know Your Neighbors: Web Spam Detection Using the Web Topology,” In Proc. 30th Annual Int'l ACM SIGIR Conf. on Research and Development in Information

Retrieval, pp. 423-430, Amsterdam, The Netherlands, July 2007.

[GG05] Gyongyi, Z., Berkhin, P., and Garcia-Molina, H., “Web Spam Taxonomy,” In Proc. 1st Int'l Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pp. 39-47, Chiba, Japan, May 2005.

[GBG06] Gyongyi, Z., Berkhin, P., Garcia-Molina, H., and Pedersen, J., “Link Spam Detection Based on Mass Estimation,” In Proc. 32th Int'l Conf. on Very Large Data Bases (VLDB), pp. 439-450, Seoul, Korea, Sept. 2006.

[GGP04] Gyongyi, Z., Garcia-Molina, H., and Jan, P., “Combating Web Spam with TrustRank,” In Proc. 30th Int'l Conf. on Very Large Data Bases (VLDB), pp. 576-587, Toronto, Canada, Aug. 2004.

Apr 7, 2023

Page 34: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 34

References (2/2)[KR06] Krishnan, V. and Raj, R., “Web Spam Detection with Anti-TrustRank,” In Proc. 2nd Int'l Workshop on Adversarial

Information Retrieval on the Web (AIRWeb), pp. 37-40, Washington, USA, Aug. 2006.

[WD05] Wu, B. and Davison, B., “Identifying Link Farm Spam Pages,” In Proc. Special Interest Tracks and Posters of the 14th Int'l Conf. on World Wide Web (WWW), pp. 820-829, Chiba, Japan, May 2005.

[SM86] Salton, G. and McGill, M. J., Introduction to Modern Information Retrieval, McGraw-Hill, 1986.

[YUT08] Yoshida, Y., Ueda, T., Tashiro, T., Hirate, Y., and Yamana, “What's Going on in Search Engine Rankings,” In Proc. 22nd Int'l Conf. on Advanced Information Networking and Applications (AINAW), pp. 1199 - 1204, Okinawa, Japan, Mar. 2008.

Apr 7, 2023

Page 35: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 35

THANK YOU VERY MUCH!

Apr 7, 2023

Page 36: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 36

MTR Algorithm

Apr 7, 2023

Supplement

Input:

A seed set of non-spam domains N A seed set of spam domains S The threshold cutoff

The difference threshold ε Web graph G= (V, E)

Output: A set of non-spam domains ON Trust score vector of all domains

Algorithm: 1. for each d ∈ V 2. if d ∈ N then

3.

4. else 5. T0[d] = 0 6. i = 0 7. do 8. for each d ∈ V 9. for each (d, q) ∈ E 10. if q ∉ S then

11.

12. for each d ∈ V 13. 14. Δ = | Ti+1 - Ti | 15. i = i + 1 16. until Δ < ε 17. Tordered = Order Ti+1 by trust scores in descending order 18. ON = Highest trust score domains within cutoff 19. return ON, Tordered

1 size(N)

T0[d]

Ti[d] damp Ti+1[q] Ti+1[q]

T Noutlink(d)

). 1 ( Ti[d] damp Ti+1[d] Ti+1[d]

Page 37: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 37

MATR Algorithm

Apr 7, 2023

Supplement

Input:

A seed set of non-spam domains N A seed set of spam domains S The threshold cutoff The difference threshold Δ Web graph G= (V, E)

Output: A set of spam domains OS Anti-trust score vector of all domains

Algorithm: 1. for each d ∈ V 2. if d ∈ S then

3.

4. else 5. AT0[d] = 0 6. i = 0 7. do 8. for each d ∈ V 9. for each (q ,d) ∈ E 10. if q ∉ N then

11.

12. for each d ∈ V 13. 14. Δ = | ATi+1 - ATi | 15. i = i + 1 16. until Δ < ε 17. ATordered = Order ATi+1 by anti-trust scores in descending order 18. OS = Highest anti-trust score domains within cutoff 19. return OS, ATordered

][)1(][][ 11 dATdampdATdAT iii

)(][][][ 11

dNdATdampqATqAT

inlink

iii

1 size(S)

AT0[d]

Page 38: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 38

MSM Algorithm

Apr 7, 2023

Supplement

Input: A seed set of non-spam domains N A seed set of spam domains S The threshold topPR The threshold relativeMass The difference threshold Δ Web graph G= (V, E) Output: A set of spam domains OS Algorithm: 1. ON, T = Modified TrustRank(N, S, cutoff,, Δ, G) 2. P = PageRank(Δ, G) 3. for each d ∈ V 4. if P[d] ≥ topPR then

5. if then

6. OS ← OS ⋃ {d} 7. return OS

ssrelativeMadP

dTdP

][

][][

Page 39: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 39

MLFS Algorithm

Apr 7, 2023

Supplement

Input:

A seed set of non-spam domains N A seed set of spam domains S The threshold limitBL The threshold limitOL Web graph G= (V, E)

Output: A set of spam domains OS

Algorithm: 1. OS ← S 2. for each d ∈ V 3. if d ∉ N then 4. I = inDomain(d) – N – {d} 5. O = outDomain(d) – N – {d} 6. if size( I ∩ O) ≥ limitBL 7. OS ← OS ⋃ {d} 8. do 9. Oold ← OS 10. for each d ∈ V 11. if d ∉ N then 12. O = outDomain(d) ∩ OS 13. if size(O) ≥ limitOL 14. OS ← OS ⋃ {d} 15. until size(OS) > size(Oold) 16. return OS

Page 40: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 40

TR vs. MTR

Apr 7, 2023

Supplement

(a) (b)

(c) (d)

(e) (f)

RatioTop =10%

RatioTop =50%

RatioTop =100%

Page 41: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 41

ATR vs. MATR

Apr 7, 2023

Supplement

(a) (b)

(c) (d)

(e) (f)

RatioTop =10%

RatioTop =50%

RatioTop =100%

Page 42: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 42

SM vs. MSM

Apr 7, 2023

Supplement

(a) (b)

(c) (d)

(e) (f)

topPR =70%

topPR =85%

topPR =100%

Page 43: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 43

LFS vs. MLFS

Apr 7, 2023

(a)

(b)

Supplement

Page 44: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 44

MSM performs better than the rest due to the minimization of False Positives while almost comparable to best in terms of True Positives.

The Best Successionfor the Spam Detector

Apr 7, 2023

Fig x: Comparison for the spam detector

The winner is MSM for Spam Detector.

Supplement

Page 45: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 45

MATR-MTR-MSM performs better than both SM and MSM. The MATR-MTR-MSM finds more True Positives than these two algorithms with comparable False Positives.

Comparison among the Best Succession, theBest Known Algorithm and Best Modified

Algorithm

Apr 7, 2023

MATR-MTR-MSM is very effective compared to best known algorithm.

Supplement

Page 46: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 46

Possible Combinations for SeedRefinement Module

Apr 7, 2023

Supplement

Succession 1 (MATR-MTR) Succession 2 (MTR-MATR)

MATR

MTR

Manual spam and non-spam seed domains

Manual non-spam domains and refined spam domains

Manual spam and non-spam seed domains

MTR

MATR

Refined spam and non-spam seed domains Refined spam and non-spam seed domains

Manual spam domains and refined non-spam domains

Seed Refiner

Seed Refiner

Algorithm Class Data flow

Page 47: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 47

Possible Combinations for SpamDetection Module

Apr 7, 2023

Supplement

Combinations Single Algorithm

MLFS-MSM MSM-MLFS MLFS MSM

Succession 1 (MLFS-MSM) Succession 2 (MSM-MLFS)

MLFS

MSM

Refined spam/non-spam seed domains

Spam domains and refined non-spam domains

Refined spam/non-spam seed domains

MSM

MLFS

Detected spam domains Detected spam domains

Spam domains and refined non-spam domains

Spam Detector

Spam Detector

Algorithm Class Data flow

Page 48: Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Database and Multimedia Lab 48

TR and ATR problem

Apr 7, 2023

1

2

31/2

t(1)=1

t(2)=1

t(3)=5/6

1/2

1/31/3

1/3

5/12

5/12

A seed non-spam domain

t(i): The trust score of domain iThe domains 5 and 6 are involved in Web spam.

A domain being consideredt(5)=5/12 +

5 6

4t(4)=1/3

t(6)=5/12 + …

5/12

5/12

1

2

31/2at(1)=1

at(2)=1

at(3)=5/6

1/2

1/3

1/3

1/3

5/12

5/12

4

The domains 5 ,6 and 7 are non- spam domains.

at(5)=5/12

at(6)=5/12 + …

56

at(i): The anti-trust score of domain i

A domain being considered

A seed spam domain

75/12

at(4)=1/3

5/12

5/12 at(7)=5/12 + …

Supplement