CEAS 2008 Battista Biggio, Ignazio Pillai, Giorgio Fumera, Fabio Roli Pattern Recognition and Applications Group University of Cagliari, Italy Department of Electrical and Electronic Engineering R A P G 5th Conference on Email and Anti-Spam (CEAS) 2008, Mountain View, California, USA, August 21st - 22nd Improving Image Spam Filtering Using Image Text Features
16
Embed
Improving Image Spam Filtering Using Image Text Features
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
CEAS 2008
Battista Biggio, Ignazio Pillai, Giorgio Fumera, Fabio Roli
Pattern Recognition and Applications GroupUniversity of Cagliari, ItalyDepartment of Electrical and Electronic Engineering
R AP G
5th Conference on Email and Anti-Spam (CEAS) 2008,Mountain View, California, USA, August 21st - 22nd
Improving Image Spam FilteringUsing Image Text Features
21-08-2008 Image Spam Filtering 2CEAS 2008
About me
• Pattern Recognition and Applications Grouphttp://prag.diee.unica.it– DIEE, University of Cagliari, Italy.
• Fumera et al., JMLR 2006– BayesOCR plug-in for SpamAssassin
– Image classifiers (ham/spam) based on low-levelimage features (text areas, color distribution, etc.)• Wu et al., ICIP 2005• Aradhye et al., ICDAR 2005• Dredze et al., CEAS 2007
21-08-2008 Image Spam Filtering 7CEAS 2008
Our past work• OCR is not effective against obfuscated images
– Spammers learned from CAPTCHAs / HIPs!• Our idea: the presence of adversarial obfuscated text
can be a spamminess hint (Biggio et al., CEAS 2007)– How did we detect the presence of adv. obfuscated text?
• Four features based on:– Text localisation– Perimetric complexity– Edge detection
• However, these features did not work as we thought fordetecting only adversarial obfuscated text…
21-08-2008 Image Spam Filtering 8CEAS 2008
This work
• Our image text defect measures seemed to beable to provide some discriminant informationabout low level text characteristics betweenham and spam images
• We exploit the proposed image text defectmeasures as additional features in approachesbased on image classification techniques, toimprove their discriminant capability
21-08-2008 Image Spam Filtering 9CEAS 2008
Experiments
• Data sets (1)– A: 2006 ham images, 3297 spam images– B: 2006 ham images, 8549 spam images
• Image feature sets– Aradhye et al., ICDAR 2005
• Color heterogeneity, color saturation, text area
– Dredze et al., CEAS 2007• Image meta-data, visual features
– Four other visual features, for comparison (generic)• Number of colors (log), number of pixels (log), relative
area occupied by the most common color, text area
– Features used in this work (text)
(1) Data sets are publicly available at: http://prag.diee.unica.it/n3ws1t0/eng/spamRepository
21-08-2008 Image Spam Filtering 10CEAS 2008
Experiments (cont’d)
• We evaluated performances of imageham/spam classifiers based on individualfeature sets (aradhye, dredze, generic) andtheir fusion (either at feature or score level) withour features (text).
C(x1∪x2) C(x2)
C(s)
C(x1)
Feature level fusion Score level fusion
21-08-2008 Image Spam Filtering 11CEAS 2008
Results
21-08-2008 Image Spam Filtering 12CEAS 2008
Image CerberusImage Cerberus
A plug-in for SpamAssassin:Image Cerberus
• We implemented a SpamAssassin plug-in based on ourapproach– generic + text fused at feature level