Top Banner
PPM based Spam Filtering in SEWM2008 Liu JuXin, Xu Congfu, Peng Peng, Lu Guanzhong [email protected],[email protected] ,[email protected] [email protected] College of Computer Science, Zhejiang University April 10, 2008
12

PPM based Spam Filtering in SEWM2008

Jan 20, 2016

Download

Documents

suzuki

PPM based Spam Filtering in SEWM2008. Liu JuXin, Xu Congfu, Peng Peng, Lu Guanzhong [email protected],[email protected] ,[email protected] [email protected] College of Computer Science, Zhejiang University April 10, 2008. Outline. PPM( prediction by partial matching ) - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: PPM based Spam Filtering in SEWM2008

PPM based Spam Filtering

in SEWM2008Liu JuXin, Xu Congfu, Peng Peng, Lu

Guanzhong

[email protected],[email protected],[email protected] [email protected]

College of Computer Science, Zhejiang UniversityApril 10, 2008

Page 2: PPM based Spam Filtering in SEWM2008

Outline

PPM( prediction by partial matching ) Email Pre-processing Train PPM Model Model Classification

Page 3: PPM based Spam Filtering in SEWM2008

PPM

Data Compression

Page 4: PPM based Spam Filtering in SEWM2008

PPM Framework

Page 5: PPM based Spam Filtering in SEWM2008

Email Pre-processing

Source alphabet Merge continuous spaces Truncate long messages

Page 6: PPM based Spam Filtering in SEWM2008

Email Pre-processing

Raw DataAbcd_= - Af?/[]=+ safj =ab fe addfe

Sample:Alphabet : {a,b,c,d,e,f,_,=, }Replace char: ?Truncate length: 20

After Replaceabcd_= ? Af????=? ?af? =ab fe addfe

After Merge Blankabcd_= ? Af????=? ?af? =ab fe addfe

After Truncateabcd_= ? Af????=? ?a

Page 7: PPM based Spam Filtering in SEWM2008

Train PPM Model

Use order-6 PPM* model Use Method D Escape estimation Train Two PPM model HAM Model SPAM Model

Page 8: PPM based Spam Filtering in SEWM2008

Model Classification

MCE( Minimum Cross-entropy ) MDL( Minimum Description Length ) Spam Score

Page 9: PPM based Spam Filtering in SEWM2008

Advantage

Simple pre-processing No decode ( avoid obfuscate ) Highly self-adaptive Low false positive

Page 10: PPM based Spam Filtering in SEWM2008

Reference

《 Spam Filtering Using Statistical Data Compression Models 》

《 Unbounded Length Contexts for PPM 》

Page 11: PPM based Spam Filtering in SEWM2008

Question

Delay Index ham, Ham and HAM Active learning 10000

Deliver the filter

Page 12: PPM based Spam Filtering in SEWM2008

Thanks for your attention!Q&A