Phishing Detection Using Probabilistic Latent Semantic … 0.012317 disnei 0.000226 password 0.013333 dvd 0.000223 failur 0.000405 ashle 0.000182 suspend 0.000523 simpson 0.000209

Phishing Detection Using Probabilistic Latent Semantic

Analysis

Venkatesh Ramanathan & Dr. Harry WechslerDepartment of Computer Science,

George Mason University,Fairfax, VA 22030

Quantitative Methods in Defense and National Security 2010George Mason University

May 25-26, 2010

My name VR. I’m student @ GMU. My advisor is prof. harry wechsler. I’m currently pursuing ph.d in the dept of computer science & my dissertation is on the area of phishing detection & prevention. My presentation today will focus on some initial work that I’ve done on this area.

OUTLINE

n Motivation and Goalsn Backgroundn Methodologyn Implementationn Experimental Design and Resultsn Experimental Design and Resultsn Conclusions and Future Research

My presentation is organized as follows.

n In phishing, attacker tricks users to divulge personal and financial datan More than 50,000 unique attacks detected per month by anti-phishing group

Motivation and Goals

FishingPhishing

(BOAT)

Attacker(fishermen)

Victim

(Bait)

(fish)

Phishing is a criminal activity using “social engineering” (?) technique. Like in traditional fishing, fisherman in a boat trolls the river to catch fish, the attacker trolls the internet to trick users to reveal personal & financial data. The bait attacker uses is usually an email message with convincing content. APWG, a non profit org reports more than 50, 000 unique phish attacks per month. 75% of those attacks target financial institutions. Cost of phishing was reported to be 3.2 billion US dollars in 2007. In Q4 of 2009, there were 10 million infected computers. So the cost of phishing is only going to get higher as we start relying more & more on the internet to conduct various transactions. But it is infeasible for humans to recognize these baits as they look authentic.

Goal and Method

n Goal:n Develop a self managing computing system that detects and

prevents phishing attacks, predicts future attacks and automatically adapts to new attacks

n Presentation focus:Development of a content filtering module to detect phishing n Development of a content filtering module to detect phishing attacks

n Module employs:n Probabilistic Latent Semantic Analysis for topic identification

and classification

Background


Attacker’s Motivation & Mode of Operation

n Motivationn Exploit security hole and consumer’s trust to gather personal

informationn Use stolen information for financial benefit, identity theft

and to commit other fraudMode of Operationn Mode of Operationn Find domain name that would resemble current targetn Find similarities that would make phisher’s site legitimaten Register the domainn Find reliable anonymous host (offshore)n Design web page similar to source web site with added code

to collect and post user’s credentials to attacker’s server

SMTPMailServer

Phishing Examples: Collect User Credentials

IMAPMailServer

MailDatabase

2. Mail ServerProcess & storesit in victim’s account

3. Victim’s PC fetches“phish” email

PhishingWeb Site

Attacker

Victim

Real Web Site1. Compose

“Phish”Email & Sends

it in victim’s account

4. Victim opens“phish” email

5. Victim clickson a link

6. Links launches browser& takes to a phishing web site

7. Victim, thinking it’s a realWeb siteenters username/password

8. Attacker collectsCredentials & putsup invalid login, redirectsto real web site,takes down phishing site

9. Victim enterscredentials again& logs into real site

Phishing Examples :Email To Collect User Credentials

Received: from 101.212.50.151 by ; Tue, 15 Nov 2005 16:27:51 -0700From: "Credit Union" [email protected] PRETENDS TO BE FROM REAL ACCOUNTReply-To: "Credit Union" [email protected] PHISHING EMAIL ADDRESS/DOMAINTo: [email protected], [email protected], [email protected]:FCU: Account updateDate: Tue, 15 Nov 2005 19:28:51 -0400Content-Type: text/html;<html> HTML EMAIL (EASIER TO DISGUISE/HIDE LINKS)<head></head><body>

TwoDifferentAddresses

<body>Credit Union is constantly working to ensure security by regularly CONVINCING CONTENT screening the accounts in our system. We recently reviewed your account, and we need more information to help us provide you with secure service.Until we can collect this information, your access to sensitive account RISK, IF NO ACTIONfeatures will be limited. We would like to restore your access as soon as possible, and we apologize for the inconvenience. <br>...

How can I restore my account access? Please confirm your identity here: Restore<a href=3D"http://200.73.81.212/.CREDIT-UNION/update.php">My PHISHING WEBSITE LINKOnline Banking</a> and complete the "Steps to Remove Limitations." Completing all of the checklist items will automatically restore your account access. </html>

SMTPMailServer

Protection Techniques

IMAPMailServer

MailDatabase

1. Network Level

3. Filters/Classifiers

2. Authentication

PhishingWeb Site

Attacker

Victim

Real Web Site 4. User

ProfileFilters, Toolbars5. Mimic

Prevention

6. User Education

Methodology


Methodologyn Self-Managing computing system

that adapts to unpredictable changes whilst hiding intrinsic complexity to operators and usersn Self-Protection: System detects

malicious activities and protects from such acts without human involvement

Self-Protection

Autonomic Computing

involvementn Self-Optimization: System

optimizes itself due to changes in load and environment

n Self-Configuration: System configures itself upon initialization, restart and changes in environment and load

n Self-Healing: System recovers automatically from hardware and software failures

Self-Optimization

Self-Configuration

Self-Healing

System

Self-Optimization

System Architecture

ContentClassifier (PLSA)

ContentParser

FeatureExtractor

Watch List

Link Analyzer

Fold In Technique

RepositoryUpdater

Time SliceControlWeb

Crawler

Phishing Detection Module

Adaptive Module

Event Manager

RoundRobin Module

SystemConfiguration

Fold In Technique

ExternalData Sources

Watch ListMonitor

EventCollectorEvent

AnalyzerPredictionModeler

Event Publisher

Watch ListRepository

SelfProtectionEngine

SelfOptimization

Engine

SelfConfiguration

Engine

Self HealingEngine

New AttackMonitorWorkloadMonitor

PerformanceMonitor

SoftwareUpdater

Predictive Module

Optimization Module

Repair Module

ContentCache


n To detect phishing attacks module will:n employ filtering model that

screens out good email from phishing email

PHISHINGEMAIL GOOD EMAIL

PHISHINGphishing email

n suspicious ones are stored for other modules use.

n employs PLSA, a natural language processing technique.

GOOD EMAIL

PHISHINGEMAIL

WATCHLIST

SUSPICIOUS


n PLSA makes use of “context” and term co-occurrences.n Handles topics containing “polysemy” words

n words that mean differently in different contextn Bank : river bank vs financial banking system Based on likelihood principle and has solid statistical n Based on likelihood principle and has solid statistical foundation.

n Standard statistical techniques can be applied for model fitting.

PLSA Introduction

n PLSA:n Given:

n a set of Documents, D = {d1, d2,…, },

n Find:Latent (Hidden) Topic “Z” =

TermsDocuments

Usernamen Latent (Hidden) Topic “Z” = {z1, z2, …}

n From n Vocabulary W = {w1, w2, …}

n Topic/concept probabilities are estimated based on all documents that are dealing with a concept.

n No prior knowledge about concepts required.

Topics/Latent

Concepts

ACCOUNT

Username

loginname

logonname

Account

∑=

=K

kjkkiji dzpzwpdwp

1

)|()|()|(

PLSA Introduction

Observed worddistributions

word distributionsper topic

Topic distributionsper document

PLSA Example

PLSA Algorithmn Step 1: Build Term-Document Matrix:

n W={w1,..wj, f1,…fj}; wi – words; fi - featuresn Step 2: Initialize probabilities P(z), P(d|z), P(w|z) randomly.

Document-Term Matrix

w 1 . . . 1,j fw . . . Jf

d 1

. . .

id. . .

d I

D

W

. . .

. . .

. . .

. . .

Document-Term Matrix

PLSA : EM Algorithm

n Step 3: E-step: compute posterior probabilities for the latent variables

n Step 4 : M-step: maximize the expected complete data log-likelihoodlikelihood

n Step 5: Compute log likelihoodn Step 6: Iterate until desired threshold

PLSA – Folding-In Technique

n Estimate probability on ‘unseen’ document (dnew).n P(w|z) unchanged. Use estimates from training.n E-step:

n M-step

Implementation



n To separate good, bad and the suspicious, module employs these functional modules:

n Content Parser: Parses content and removes noise.

n Feature Extractor: Extracts rich feature set for content classification.Link Analyzers : Analyze Link

ContentParser

FeatureExtractor

n Link Analyzers : Analyze interdependency between email & web content (not done yet)

n Content Classifier: Employs filters to separate good, bad and the suspicious.

n Fold In Technique: Compute topic distribution probabilities of a new document.

n Watch List Repository: External module that stores suspicious ones for other components use.

LinkAnalyzer

Watch ListRepository

ContentClassifier (PLSA)

Fold In Technique


n Training corpus is used to build filters.

n Filters are built using PLSAn Folding-In technique is used

to classify new content.

Training Repository

ValidationRepository

Fold-InTechnique

New Content

Repository

Filter DesignPLSA

Experimental Design and Results


Experimental Design

n Designn PhishingCorpus data set: 750 good emails, 250 Phishing

emailsn K-fold cross validation (k = 10)n Classification using PLSA

n Content Pre Processingn Removed extraneous characters, HTML tags.n Removed stop words n Employed Porter’s stemming.

Experimental Design

n Content Parsingn Email data:

n Implemented MIME message parsing.n Parsed email headers, email text, hyperlinks and code.n Removed HTML tags.n Removed HTML tags.

n Feature Extractionn Email Feature – Words only

Experimental Design –Topic Identification

(Training)

Phishing CorpusTraining Dataset

INPUT

Preprocess(parse, remove stop words,

stemming)

Build Term DocumentFrequency Matrix

Initialize ProbabilitiesP(d), P(w|z), p(z|d)

Run TEM

OUTPUT

Topics (z1 Phishing, z2 Non-Phishing)p(w|z1), p(w|z2)

Experimental Design –Phishing Detection

(Testing)

Phishing CorpusTest Dataset

INPUT

Preprocess(parse, remove stop words,

stemming)

Build Term DocumentFrequency Matrix

Fix – z, p(w|z)Initialize

P(dnew), p(z|dnew)

Run TEM

YES

P(z1|dnew) > threshold

PHISHING EMAIL

NONON PHISHING EMAIL

Results- PLSA Topic Identification

Topic: Phishing Topic: Non Phishing

w P(w|z) w P(w|z)

ebai 0.028786 video 0.000225

confirm 0.000352 game 0.000102

usernam 0.145218 playsta 0.000203

bill 0.000549 movi 0.000179bill 0.000549 movi 0.000179

issue 0.000104 music 0.00022

account 0.012317 disnei 0.000226

password 0.013333 dvd 0.000223

failur 0.000405 ashle 0.000182

suspend 0.000523 simpson 0.000209

indefinit 0.000385 star 0.000223

thank 0.002679 war 0.000221

cooper 0.000302 truck 0.000205

click 0.00267 handbag 0.000171

compromise 0.000246 diesel 0.000218

… ..

Results

ROC Curve of Phishing Email Classification

0.71

0.72

0.73

0.64

0.65

0.66

0.67

0.68

0.69

0.7

0.71

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

False Positive

True Positive

Results – PerformanceComparison

n Accuracy:n SpamAssassin – 89 %n PILFER (using SVM) – 92%n PLSA – 71 %

n Even though PLSA’s accuracy is lower than other techniques, n Even though PLSA’s accuracy is lower than other techniques, PLSA handles different word usage and polysemy while others do not.

n Additional features could not be used by PLSA due to age of the dataset (phishing domains are short lived)

n Performance improvement is expected when email data, website data and their interdependencies of recent data set is used for detection using PLSA.

Conclusions and Future Research

n Conclusions:n Results show that PLSA identifies hidden topics (phishing versus

others).n Performance of 71% accuracy was achieved using very limited

feature set (words only) from email content.n Future Research:

Expand PLSA framework to include richer feature set (email + n Expand PLSA framework to include richer feature set (email + website + interdependencies)

n Employ prediction modeler to predict future attacks from events generated by internal and external components.

n Employ adaptive modeler to adapt to changes in attacking strategies and mode of operations.

n Employ the framework in an autonomous environment for automatic identification, prevention and protection of email servers from phishing attacks.

References

1) Ahmed Abbasi and Hsinchun Chen, A comparison of tools for detecting fake web sites, Research Feature, Computer, IEE ComputerSociety, 2009. 2) Anti-Phishing Working Group (APWG): http://www.antiphishing.org3) Arel Cordero, Tamara Blain. Catching Phish: Detecting Phishing Attacks From Rendered website Images, available athttp://www.cs.berkeley.edu/~asimma/294-fall06/projects/reports/cordero.pdf.4) Autonomic Computing, http://en.wikipedia.org/wiki/Autonomic_Computing5) Financial Service Technology Consortium (FSTC): North-America based financial institutions, technology vendors, independentresearch organizations and government agency, available athttp://www.fstc.org/projects/docs/FSTC_Counter_Phishing_Project_Whitepaper.pdf6) Hasika Pamunuwa, et. al, An Intrusion Detection System for Detecting Phishing Attacks, LNCS, September, 2007.7) Ian Fette, Norman Sadeh, Anthony Thomasic. Learning to Detect Phishing Emails, to appear in WWW 2007, available at7) Ian Fette, Norman Sadeh, Anthony Thomasic. Learning to Detect Phishing Emails, to appear in WWW 2007, available athttp://www.cs.cmu.edu/~tomasic/doc/2007/FetteSadehTomasicWWW2007.pdf8) Liu Wenyin, Guanglin Huang, Liu Xiaoyue, Zhang Min, Xiaotie Deng. . Detection of Phishing Webpages based on Visual Similarity,In WWW 2005, May 10-14, 2005, Chiba, Japan9) Neil Chou, Robert Ledesma, Yuka Teraguchi, Dan Boneh, John C. Mitchell. Client-side defense against web-based identity theft(Webspoof), available at http://www.crypto.stanford.edu/SpoofGuard/webspoof.pdf.10) Niels Provos. A Virtual Honeypot Framework, available at http://www. niels.xtdnet.nl/papers/honeyd.pdf.11) Nicolas Vanderavero, Xavier Brouckaert, Olivier Bonaventure, Baudouin Le Charlier. The HoneyTank.: A Scalable Approach tocollect malicious Internet Traffic, In international infrastructure survivability workshop (IISW’04) 2004, held in conjunction with the25th IEEE International Real-time systems symposium (RTSS04). Paper available athttp://www.info.ucl.ac.be/people/OBO/papers/honeytank.pdf.12) Thomas Hoffman, Probabilistic Latent Semantic Indexing, SIGIR, 1999.13) Ulrike von Luxburg, A Tutorial on Spectral Clustering, Statistics and Computing, 2007.14) Yu Chen, Wei-Yin Ma, Hong-Jiang Zhang. Detecting Web Page Structure for Adaptive Viewing on Small Form Factor Devices, InWWW 2003, May 20-24, 2003. 15) Yue Zhang, Jason Hong, Lorrie Cranor. CANTINA: A Content Based Approach to Detecting Phishing Sites, To appear in WWW 2007and available at www. cups.cs.cmu.edu/trust.php.

THANK YOU !