LSDS-IR’08 www.ir.iit.e du 1 Cost-Effective Spam Detection in P2P File- Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology [email protected]
Dec 18, 2015
LSDS-IR’08 www.ir.iit.edu 1
Cost-Effective Spam Detection in P2P File-Sharing Systems
Dongmei JiaInformation Retrieval Lab
Illinois Institute of [email protected]
LSDS-IR’08 www.ir.iit.edu 2
Goal
• Create cost-effective ways of automatically detecting P2P spam results w/o actual file downloading
LSDS-IR’08 www.ir.iit.edu 3
Introduction
• Spam: – Any file that is misrepresented deliberately or
in a way of manipulating established retrieval and ranking techniques
• Spam is harmful– Degrade user search experience– Assist the propagation of viruses in network– Have significant impact on P2P traffic load
LSDS-IR’08 www.ir.iit.edu 4
Problem Statement
• Naïve spam detection method– Download and manually check– Cons:
• Time and labor consuming• Wastes bandwidth and storage resources• Risks of opening malware
• Hence, automatic spam detection is needed!
LSDS-IR’08 www.ir.iit.edu 5
Emule Example
Query (number of results)
Descriptors Group Size
File Key
Hard to detect spam automatically in query result set!
LSDS-IR’08 www.ir.iit.edu 6
Types of Spam
• Type 1: Files whose replicas have semantically different descriptors– E.g., different song titles for a same key
26NZUBS655CC66COLKMWHUVJGUXRPVUF:
“12 days after christmas.mp3”
“i want you thalia.mp3”
“comon be my girl.mp3”
…
LSDS-IR’08 www.ir.iit.edu 7
Types of Spam (Cont’d)
• Type 2: Files with long descriptors that contain semantically nonsensical term combinations– Single-descriptor problem– E.g., a single replica descriptor for key
1200473A4BB17724194C5B9C271F3DC4: “Aerosmith,Van Halen,Quiet Riot,Kiss, Poison, Acdc, Accept, Def Leappard, Boney M, Megadeth, Metallica, Offspring, Beastie Boys, Run Dmc, Buckcherry, Salty Dog Remix.mp3”
LSDS-IR’08 www.ir.iit.edu 8
Types of Spam (Cont’d)
• Type 3: Files with descriptors that contain no query terms– Ads or warning on the illegal distribution of
copyrighted materials– E.g., “Can you afford 0.09
www.BuyLegalMP3.com.mp3”
LSDS-IR’08 www.ir.iit.edu 9
Types of Spam (Cont’d)
• Type 4: Files that are highly replicated on a single peer– Normal users do not create multiple replicas of a same
file on a single server – Manipulate “group size” ranking– E.g., 177 replicas of the file
DY2QXX3MYW75SRCWSSUG6GY3FS7N7YC shared on a single peer
LSDS-IR’08 www.ir.iit.edu
Feature-Based Spam Detection
• Basic idea– To detect spam results by P2P features that
are strongly correlated with spam• Vocabulary size of a file’s group descriptor• Variance of terms in replica descriptors D of a file
group G– Jaccard distance: 1 - |D ∩ G| / |D G |
– Cosine distance: 1 - (VG·VD) / (|VG| |VD|)
• Per-host replication degree of a file– numRep / numHost
• …
10
LSDS-IR’08 www.ir.iit.edu 11
Probe Query
• Problem: – Results have insufficient and biased description info
• Conjunctive query matching
• Solution: – Gather more info for a result from network
• Other replica descriptors of the file• Statistics of peers who share the file
– Num of files, num of unique files, peer ID
– Implementation• Contains only a file key, not a “term” query
– Intuition• Probing helps to create a more complete view of a file• Ranking is more effective with more adequate file info
LSDS-IR’08 www.ir.iit.edu
Evaluation
• Dataset– P2P audio files crawled from Gnutella network:
• numRep = 25,137,217; numFile = 9,575,113; numPeer = 226,786
– 50 most popular queries in the crawled dataset• Representative of most users, more likely target for spam
• Metric– Num spam in top-N ranked results, esp. for a small N
• Effectiveness– Improves performance by 9% for top-200 results, by
92.5% for top-20 results• Base case: noprobe+numRep
12
LSDS-IR’08 www.ir.iit.edu
Cost Control
• Tradeoff– Performance vs. cost
• Cost– Num of responses for regular query and probe query
• Problem– Network cost is dramatically increased by probing
• How to reduce the cost?
13
LSDS-IR’08 www.ir.iit.edu
Cost Control Approaches
• Random sampling of probe query results
• Piggy-backing of descriptor data in probe queries
• Limiting the scope of probing
14
LSDS-IR’08 www.ir.iit.edu
Random Sampling
• Server-side random sampling of probe query results– A predefined probability P, 0 ≤ P ≤ 1
– Reduces cost by a factor P predictably– Impact on effectiveness of spam detection?
15
LSDS-IR’08 www.ir.iit.edu 16
Experimental Results
0
1
2
3
4
5
6
7
8
1 20 39 58 77 96 115 134 153 172 191
Top N Results
Avg
Num
Spa
m
0.250.50.751noprobe
Cost is reduced significantly by sampling fewer probe results
In all sampling cases, overall performance is still 1.7%-9% better than noprobe
0
2000
4000
6000
8000
10000
12000
14000
16000
noprobe 0.25 0.5 0.75 1
Probe Query Sampling Rate
Avg
Tot
al C
ost
But the cost is still high With 25% sampling, cost is ~7 times higher than noprobe
Performance for top-20 results is 71%-92% better than noprobe
`
LSDS-IR’08 www.ir.iit.edu
Piggy-backing of Descriptor Data
• Piggy-backing of descriptor data in probe queries– New type of probe query
• file key + descriptor of result file being probed
– Server’s descriptor will not respond if it contains no new term compared with the descriptor in probe query
• To limit num of probe results returned to client
17
LSDS-IR’08 www.ir.iit.edu 18
Experimental Results
0
1
2
3
4
5
6
7
8
1 19 37 55 73 91 109 127 145 163 181 199
Top N Results
Avg
Num
Spa
m
0.250.50.751noprobe
Compared with the original type of probe, total cost is decreased by 35%-39% for all sampling rates
Compared with the original type of probe, overall performance is dropped by ~15%
0
2000
4000
6000
8000
10000
12000
14000
16000
noprobe 0.25 0.5 0.75 1
Probe Query Sampling Rate
Avg
Tot
al C
ost
E.g., the cost with sampling rate 0.25 is ~4 times higher than noprobe
`
However, performance for top-20 results is improved by 71%-88% in all sampling cases
LSDS-IR’08 www.ir.iit.edu
Limiting Probing Scope
• Limiting the scope of probing– Only probe a few top-ranked (i.e., top-20) regular
query results– Intuition
• User tends to only consider downloading a file from a few top-ranked results
19
LSDS-IR’08 www.ir.iit.edu 20
Experimental Results
Performance of probing only top-20 results is always 22%-56% better over noprobe
Probing only the top-20 results significantly reduces cost
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
1 3 5 7 9 11 13 15 17 19
Top N Results
Avg
Num
Spa
m
0.250.50.751noprobe
0
2000
4000
6000
8000
10000
12000
14000
16000
noprobe 0.25 0.5 0.75 1
Probe Query Sampling Rate
Avg
Tot
al C
ost
E.g., cost with sampling rate 0.25 is only twice as much as that of noprobe
`
LSDS-IR’08 www.ir.iit.edu 21
Conclusion
• Feature-based spam detection techniques successfully decrease the amount of spam – 9% in top-200 results; 92% in top-20 results
• Cost control methods are effective in reducing network cost– Factor increase of cost is dropped from 7 to 2 over
noprobe– At the same time, performance is at least 22%
better over noprobe for top-20 results
LSDS-IR’08 www.ir.iit.edu 22
References• Limewire junk filter. http://wiki.limewire.org/index.php?title=Junk_Filter• J. Liang, R. Kumar, Y. Xi and K. Ross. Pollution in P2P File Sharing Systems. In
INFOCOM’05, May 2005.• K. Svore, Q. Wu, C.J.C. Burges and A. Raman. Improving Web spam classification using
Rank-time features. In Proc. AIRWeb workshop in WWW, 2007• Shlomo Hershkop, Salvatore j Stolfo. Combining Email Models for False Positive
Reduction. In proc. KDD’05. Chicago, Aug. 2005. • P. A. Chirita, J. Diederich, and W. Nejdl. MailRank: Using ranking for spam detection. In
proc. CIKM’05, Bremen, Germany, 2005.• Alexandros Ntoulas, Marc Najork, Mark Manasse, Dennis Fetterly. Detecting spam web
pages through content analysis. In Proc. of WWW'06.• Sepandar D. Kamvar, Mario T. Schlosser, and Hector Garcia-Molina. The EigenTrust
Algorithm for Reputation Management in P2P Networks. In Proc. of WWW, 2003. • Gyöngyi, Z., Berkhin, P., Garcia-Molina, H., Pedersen, J. Link spam detection based on
mass estimation. In Proc. of the 32nd International Conference on Very Large Data Bases (VLDB), ACM Press (2006), 439-450.
• Limewire. www.limewire.org• Runfang Zhou and Kai Hwang. Gossip-based Reputation Aggregation for Unstructured
Peer-to-Peer Networks. 21th IEEE International Parallel & Distributed Processing Symposium (IPDPS'07), Los Angeles, March 26-30, 2007
• Kevin Walsh, Emin Gun Sirer. Experience with an Object Reputation System for Peer-to-Peer Filesharing. In 3rd Symposium on Networked Systems Design & Implementation (NSDI), 2006
• Uichin Lee, Min Choi, Junghoo Cho, Medy. Y. Sanadidi, Mario Gerla. Understanding Pollution Dynamics in P2P File Sharing. In Proc. IPTPS'06.
LSDS-IR’08 www.ir.iit.edu 23
• Questions?
• Contact info:– WWW: www.ir.iit.edu– Email: [email protected]
Thanks fromIIT’s IR Lab!
LSDS-IR’08 www.ir.iit.edu 24
Related Work
• Email spam detection– Hershkop et al., KDD’05
• Analyze email content and syntax
– Chirita et al., CIKM’05• Construct social networks for email address
• Web spam detection– Ntoulas et al., WWW’06
• Analyze content of Web pages
– Gyongyi et al., VLDB’06• Analyze link structure of Web pages
LSDS-IR’08 www.ir.iit.edu 25
Related Work (Cont’d)
• P2P spam detection– Spam filter in Limewire
• User-controlled spam learning
– Liang et al., INFOCOM’05• Detect spam using extra info, i.e., official CD
length of a media file
– Kamvar et al., WWW’03• Build reputation systems to rank peers
LSDS-IR’08 www.ir.iit.edu 26
Simulating P2P search
• Built a system to simulate P2P search on client side
• Simulating query routing– A query is randomly sent to 50 peers– Repeat until either stop condition is satisfied
• Condition 1: num of unique results reaches 200 results• Condition 2: num of peers that have received query reaches
50K peers
– Threshold values chosen based on specifications of real-world P2P systems (i.e. Limewire’s Gnutella)
LSDS-IR’08 www.ir.iit.edu 27
Experimental Results
0
1
2
3
4
5
6
7
8
1 22 43 64 85 106 127 148 169 190
Top N Results
Avg
Nu
m S
pa
m
noprobe+numRep
noprobe+CosineQD
probe+numRep
probe+Cosine
probe+Jaccard
probe+numUniqueTerms
Compared with noprobe+numRep, probe+Cosine improves performance by 9% for top-200 results, by 92.5% for top-20 results
Compared with noprobe+CosineQD, 21.6% and 97.8%
noprobe+numRep
probe+Cosine
noprobe+CosineQD
probe+numUniqueTerms
probe+Jaccard
LSDS-IR’08 www.ir.iit.edu 28
Experimental Results (Cont’d)
Compare Cosine/Jaccard distance with numUniqueTerms in a fair way by only considering multi-replica files
0
2
4
6
8
10
12
1 15 29 43 57 71 85 99 113
Top N Results
Avg
Nu
m S
pa
m
probe+Cosineprobe+Jaccardprobe+numUniqueTerms