LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.

LSDS-IR’08 www.ir.iit.edu 1

Cost-Effective Spam Detection in P2P File-Sharing Systems

Dongmei JiaInformation Retrieval Lab

Illinois Institute of [email protected]


Goal

• Create cost-effective ways of automatically detecting P2P spam results w/o actual file downloading


Introduction

• Spam: – Any file that is misrepresented deliberately or

in a way of manipulating established retrieval and ranking techniques

• Spam is harmful– Degrade user search experience– Assist the propagation of viruses in network– Have significant impact on P2P traffic load


Problem Statement

• Naïve spam detection method– Download and manually check– Cons:

• Time and labor consuming• Wastes bandwidth and storage resources• Risks of opening malware

• Hence, automatic spam detection is needed!


Emule Example

Query (number of results)

Descriptors Group Size

File Key

Hard to detect spam automatically in query result set!


Types of Spam

• Type 1: Files whose replicas have semantically different descriptors– E.g., different song titles for a same key

26NZUBS655CC66COLKMWHUVJGUXRPVUF:

“12 days after christmas.mp3”

“i want you thalia.mp3”

“comon be my girl.mp3”

…


Types of Spam (Cont’d)

• Type 2: Files with long descriptors that contain semantically nonsensical term combinations– Single-descriptor problem– E.g., a single replica descriptor for key

1200473A4BB17724194C5B9C271F3DC4: “Aerosmith,Van Halen,Quiet Riot,Kiss, Poison, Acdc, Accept, Def Leappard, Boney M, Megadeth, Metallica, Offspring, Beastie Boys, Run Dmc, Buckcherry, Salty Dog Remix.mp3”



• Type 3: Files with descriptors that contain no query terms– Ads or warning on the illegal distribution of

copyrighted materials– E.g., “Can you afford 0.09

www.BuyLegalMP3.com.mp3”



• Type 4: Files that are highly replicated on a single peer– Normal users do not create multiple replicas of a same

file on a single server – Manipulate “group size” ranking– E.g., 177 replicas of the file

DY2QXX3MYW75SRCWSSUG6GY3FS7N7YC shared on a single peer

LSDS-IR’08 www.ir.iit.edu

Feature-Based Spam Detection

• Basic idea– To detect spam results by P2P features that

are strongly correlated with spam• Vocabulary size of a file’s group descriptor• Variance of terms in replica descriptors D of a file

group G– Jaccard distance: 1 - |D ∩ G| / |D G |

– Cosine distance: 1 - (VG·VD) / (|VG| |VD|)

• Per-host replication degree of a file– numRep / numHost

• …

10


Probe Query

• Problem: – Results have insufficient and biased description info

• Conjunctive query matching

• Solution: – Gather more info for a result from network

• Other replica descriptors of the file• Statistics of peers who share the file

– Num of files, num of unique files, peer ID

– Implementation• Contains only a file key, not a “term” query

– Intuition• Probing helps to create a more complete view of a file• Ranking is more effective with more adequate file info


Evaluation

• Dataset– P2P audio files crawled from Gnutella network:

• numRep = 25,137,217; numFile = 9,575,113; numPeer = 226,786

– 50 most popular queries in the crawled dataset• Representative of most users, more likely target for spam

• Metric– Num spam in top-N ranked results, esp. for a small N

• Effectiveness– Improves performance by 9% for top-200 results, by

92.5% for top-20 results• Base case: noprobe+numRep

12


Cost Control

• Tradeoff– Performance vs. cost

• Cost– Num of responses for regular query and probe query

• Problem– Network cost is dramatically increased by probing

• How to reduce the cost?

13


Cost Control Approaches

• Random sampling of probe query results

• Piggy-backing of descriptor data in probe queries

• Limiting the scope of probing

14


Random Sampling

• Server-side random sampling of probe query results– A predefined probability P, 0 ≤ P ≤ 1

– Reduces cost by a factor P predictably– Impact on effectiveness of spam detection?

15


Experimental Results

0

1

2

3

4

5

6

7

8

1 20 39 58 77 96 115 134 153 172 191

Top N Results

Avg

Num

Spa

m

0.250.50.751noprobe

Cost is reduced significantly by sampling fewer probe results

In all sampling cases, overall performance is still 1.7%-9% better than noprobe

0

2000

4000

6000

8000

10000

12000

14000

16000

noprobe 0.25 0.5 0.75 1

Probe Query Sampling Rate

Avg

Tot

al C

ost

But the cost is still high With 25% sampling, cost is ~7 times higher than noprobe

Performance for top-20 results is 71%-92% better than noprobe

`


Piggy-backing of Descriptor Data

• Piggy-backing of descriptor data in probe queries– New type of probe query

• file key + descriptor of result file being probed

– Server’s descriptor will not respond if it contains no new term compared with the descriptor in probe query

• To limit num of probe results returned to client

17



0

1

2

3

4

5

6

7

8

1 19 37 55 73 91 109 127 145 163 181 199

Top N Results

Avg

Num

Spa

m

0.250.50.751noprobe

Compared with the original type of probe, total cost is decreased by 35%-39% for all sampling rates

Compared with the original type of probe, overall performance is dropped by ~15%

0

2000

4000

6000

8000

10000

12000

14000

16000

noprobe 0.25 0.5 0.75 1


Avg

Tot

al C

ost

E.g., the cost with sampling rate 0.25 is ~4 times higher than noprobe

`

However, performance for top-20 results is improved by 71%-88% in all sampling cases


Limiting Probing Scope

• Limiting the scope of probing– Only probe a few top-ranked (i.e., top-20) regular

query results– Intuition

• User tends to only consider downloading a file from a few top-ranked results

19



Performance of probing only top-20 results is always 22%-56% better over noprobe

Probing only the top-20 results significantly reduces cost

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

1 3 5 7 9 11 13 15 17 19

Top N Results

Avg

Num

Spa

m

0.250.50.751noprobe

0

2000

4000

6000

8000

10000

12000

14000

16000

noprobe 0.25 0.5 0.75 1


Avg

Tot

al C

ost

E.g., cost with sampling rate 0.25 is only twice as much as that of noprobe

`


Conclusion

• Feature-based spam detection techniques successfully decrease the amount of spam – 9% in top-200 results; 92% in top-20 results

• Cost control methods are effective in reducing network cost– Factor increase of cost is dropped from 7 to 2 over

noprobe– At the same time, performance is at least 22%

better over noprobe for top-20 results


References• Limewire junk filter. http://wiki.limewire.org/index.php?title=Junk_Filter• J. Liang, R. Kumar, Y. Xi and K. Ross. Pollution in P2P File Sharing Systems. In

INFOCOM’05, May 2005.• K. Svore, Q. Wu, C.J.C. Burges and A. Raman. Improving Web spam classification using

Rank-time features. In Proc. AIRWeb workshop in WWW, 2007• Shlomo Hershkop, Salvatore j Stolfo. Combining Email Models for False Positive

Reduction. In proc. KDD’05. Chicago, Aug. 2005. • P. A. Chirita, J. Diederich, and W. Nejdl. MailRank: Using ranking for spam detection. In

proc. CIKM’05, Bremen, Germany, 2005.• Alexandros Ntoulas, Marc Najork, Mark Manasse, Dennis Fetterly. Detecting spam web

pages through content analysis. In Proc. of WWW'06.• Sepandar D. Kamvar, Mario T. Schlosser, and Hector Garcia-Molina. The EigenTrust

Algorithm for Reputation Management in P2P Networks. In Proc. of WWW, 2003. • Gyöngyi, Z., Berkhin, P., Garcia-Molina, H., Pedersen, J. Link spam detection based on

mass estimation. In Proc. of the 32nd International Conference on Very Large Data Bases (VLDB), ACM Press (2006), 439-450.

• Limewire. www.limewire.org• Runfang Zhou and Kai Hwang. Gossip-based Reputation Aggregation for Unstructured

Peer-to-Peer Networks. 21th IEEE International Parallel & Distributed Processing Symposium (IPDPS'07), Los Angeles, March 26-30, 2007

• Kevin Walsh, Emin Gun Sirer. Experience with an Object Reputation System for Peer-to-Peer Filesharing. In 3rd Symposium on Networked Systems Design & Implementation (NSDI), 2006

• Uichin Lee, Min Choi, Junghoo Cho, Medy. Y. Sanadidi, Mario Gerla. Understanding Pollution Dynamics in P2P File Sharing. In Proc. IPTPS'06.


• Questions?

• Contact info:– WWW: www.ir.iit.edu– Email: [email protected]

Thanks fromIIT’s IR Lab!


Related Work

• Email spam detection– Hershkop et al., KDD’05

• Analyze email content and syntax

– Chirita et al., CIKM’05• Construct social networks for email address

• Web spam detection– Ntoulas et al., WWW’06

• Analyze content of Web pages

– Gyongyi et al., VLDB’06• Analyze link structure of Web pages


Related Work (Cont’d)

• P2P spam detection– Spam filter in Limewire

• User-controlled spam learning

– Liang et al., INFOCOM’05• Detect spam using extra info, i.e., official CD

length of a media file

– Kamvar et al., WWW’03• Build reputation systems to rank peers


Simulating P2P search

• Built a system to simulate P2P search on client side

• Simulating query routing– A query is randomly sent to 50 peers– Repeat until either stop condition is satisfied

• Condition 1: num of unique results reaches 200 results• Condition 2: num of peers that have received query reaches

50K peers

– Threshold values chosen based on specifications of real-world P2P systems (i.e. Limewire’s Gnutella)



0

1

2

3

4

5

6

7

8

1 22 43 64 85 106 127 148 169 190

Top N Results

Avg

Nu

m S

pa

m

noprobe+numRep

noprobe+CosineQD

probe+numRep

probe+Cosine

probe+Jaccard

probe+numUniqueTerms

Compared with noprobe+numRep, probe+Cosine improves performance by 9% for top-200 results, by 92.5% for top-20 results

Compared with noprobe+CosineQD, 21.6% and 97.8%

noprobe+numRep

probe+Cosine

noprobe+CosineQD

probe+numUniqueTerms

probe+Jaccard


Experimental Results (Cont’d)

Compare Cosine/Jaccard distance with numUniqueTerms in a fair way by only considering multi-replica files

0

2

4

6

8

10

12

1 15 29 43 57 71 85 99 113

Top N Results

Avg

Nu

m S

pa

m

probe+Cosineprobe+Jaccardprobe+numUniqueTerms

LSDS-IR’08 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.

Documents

p2p file

file ranking

file num of files

introduction spam

types of spam type

adequate file info slide

costeffective spam detection

single peer slide