Improving Image Spam Filtering Using Image Text Features

CEAS 2008

Battista Biggio, Ignazio Pillai, Giorgio Fumera, Fabio Roli

Pattern Recognition and Applications GroupUniversity of Cagliari, ItalyDepartment of Electrical and Electronic Engineering

R AP G

5th Conference on Email and Anti-Spam (CEAS) 2008,Mountain View, California, USA, August 21st - 22nd

Improving Image Spam FilteringUsing Image Text Features

21-08-2008 Image Spam Filtering 2CEAS 2008

About me

• Pattern Recognition and Applications Grouphttp://prag.diee.unica.it– DIEE, University of Cagliari, Italy.

• Contact– Battista Biggio, Ph.D. student

[email protected]


Pattern Recognition andApplications Group

• Research interests– Methodological issues

• Multiple classifier systems• Adversarial learning• Classification reliability

– Main applications• Intrusion detection in

computer networks• Multimedia document

categorization, Spam filtering• Biometric authentication

(fingerprint, face)• Content-based image

retrieval

R AP G

• Faculty members– F. Roli (group head)– G. Giacinto– G. Fumera– L. Didaci– G.L. Marcialis

– 7 PhD students– 3 post docs– 2 consultants


Outline

• Introduction– What is image spam?

• Image spam filtering– Image spam SoA– Our work

• Experiments

• A plug-in for SpamAssassin: Image Cerberus


Image spam

• Since about 2005: image spam– Embedding spam messages into images to evade

modules based on machine learning approaches(e.g. bayesian filters)

– Adding adversarial noise to prevent OCR fromreading embedded text (obfuscated spam images)


Image spam SoA

• Commercial / open source anti-spam filters:– OCR + keyword search– Image low-level feature analysis

• Research:– OCR + TC

• Fumera et al., JMLR 2006– BayesOCR plug-in for SpamAssassin

– Image classifiers (ham/spam) based on low-levelimage features (text areas, color distribution, etc.)• Wu et al., ICIP 2005• Aradhye et al., ICDAR 2005• Dredze et al., CEAS 2007


Our past work• OCR is not effective against obfuscated images

– Spammers learned from CAPTCHAs / HIPs!• Our idea: the presence of adversarial obfuscated text

can be a spamminess hint (Biggio et al., CEAS 2007)– How did we detect the presence of adv. obfuscated text?

• Four features based on:– Text localisation– Perimetric complexity– Edge detection

• However, these features did not work as we thought fordetecting only adversarial obfuscated text…


This work

• Our image text defect measures seemed to beable to provide some discriminant informationabout low level text characteristics betweenham and spam images

• We exploit the proposed image text defectmeasures as additional features in approachesbased on image classification techniques, toimprove their discriminant capability


Experiments

• Data sets (1)– A: 2006 ham images, 3297 spam images– B: 2006 ham images, 8549 spam images

• Image feature sets– Aradhye et al., ICDAR 2005

• Color heterogeneity, color saturation, text area

– Dredze et al., CEAS 2007• Image meta-data, visual features

– Four other visual features, for comparison (generic)• Number of colors (log), number of pixels (log), relative

area occupied by the most common color, text area

– Features used in this work (text)

(1) Data sets are publicly available at: http://prag.diee.unica.it/n3ws1t0/eng/spamRepository


Experiments (cont’d)

• We evaluated performances of imageham/spam classifiers based on individualfeature sets (aradhye, dredze, generic) andtheir fusion (either at feature or score level) withour features (text).

C(x1∪x2) C(x2)

C(s)

C(x1)

Feature level fusion Score level fusion


Results


Image CerberusImage Cerberus

A plug-in for SpamAssassin:Image Cerberus

• We implemented a SpamAssassin plug-in based on ourapproach– generic + text fused at feature level

• Publicly available– http://prag.diee.unica.it/n3ws1t0/imageCerberus

• We will release source code (C++) soon

We need your feedback!

R AP G


Some examples

score = 1.06 score = 0.98 score = 0.28


Some examples (cont’d)

score = 0.82score = 1.00score = 0.63


• Ham images from the TREC 2007 spam corpus!

Spam or ham?

score = 0.20

score = - 1.4

score = 0.27


Thank you!

• See you at the poster session!

• Contacts– [email protected]

– [email protected]



• Web– http://prag.diee.unica.it

R AP G

Improving Image Spam Filtering Using Image Text Features

Technology

image spam filtering

spam ceas

image cerberus21

spam messages

spam corpus

spam images b

image metadata

image cerberus p r