Phishing Detection Using Probabilistic Latent Semantic Analysis Venkatesh Ramanathan & Dr. Harry Wechsler Department of Computer Science, George Mason University, Fairfax, VA 22030 Quantitative Methods in Defense and National Security 2010 George Mason University May 25-26, 2010
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Phishing Detection Using Probabilistic Latent Semantic
Analysis
Venkatesh Ramanathan & Dr. Harry WechslerDepartment of Computer Science,
George Mason University,Fairfax, VA 22030
Quantitative Methods in Defense and National Security 2010George Mason University
May 25-26, 2010
My name VR. I’m student @ GMU. My advisor is prof. harry wechsler. I’m currently pursuing ph.d in the dept of computer science & my dissertation is on the area of phishing detection & prevention. My presentation today will focus on some initial work that I’ve done on this area.
OUTLINE
n Motivation and Goalsn Backgroundn Methodologyn Implementationn Experimental Design and Resultsn Experimental Design and Resultsn Conclusions and Future Research
My presentation is organized as follows.
n In phishing, attacker tricks users to divulge personal and financial datan More than 50,000 unique attacks detected per month by anti-phishing group
Motivation and Goals
FishingPhishing
(BOAT)
Attacker(fishermen)
Victim
(Bait)
(fish)
Phishing is a criminal activity using “social engineering” (?) technique. Like in traditional fishing, fisherman in a boat trolls the river to catch fish, the attacker trolls the internet to trick users to reveal personal & financial data. The bait attacker uses is usually an email message with convincing content. APWG, a non profit org reports more than 50, 000 unique phish attacks per month. 75% of those attacks target financial institutions. Cost of phishing was reported to be 3.2 billion US dollars in 2007. In Q4 of 2009, there were 10 million infected computers. So the cost of phishing is only going to get higher as we start relying more & more on the internet to conduct various transactions. But it is infeasible for humans to recognize these baits as they look authentic.
Goal and Method
n Goal:n Develop a self managing computing system that detects and
prevents phishing attacks, predicts future attacks and automatically adapts to new attacks
n Presentation focus:Development of a content filtering module to detect phishing n Development of a content filtering module to detect phishing attacks
n Module employs:n Probabilistic Latent Semantic Analysis for topic identification
and classification
Background
n Motivation and Goalsn Backgroundn Methodologyn Implementationn Experimental Design and Resultsn Experimental Design and Resultsn Conclusions and Future Research
Attacker’s Motivation & Mode of Operation
n Motivationn Exploit security hole and consumer’s trust to gather personal
informationn Use stolen information for financial benefit, identity theft
and to commit other fraudMode of Operationn Mode of Operationn Find domain name that would resemble current targetn Find similarities that would make phisher’s site legitimaten Register the domainn Find reliable anonymous host (offshore)n Design web page similar to source web site with added code
to collect and post user’s credentials to attacker’s server
SMTPMailServer
Phishing Examples: Collect User Credentials
IMAPMailServer
MailDatabase
2. Mail ServerProcess & storesit in victim’s account
3. Victim’s PC fetches“phish” email
PhishingWeb Site
Attacker
Victim
Real Web Site1. Compose
“Phish”Email & Sends
it in victim’s account
4. Victim opens“phish” email
5. Victim clickson a link
6. Links launches browser& takes to a phishing web site
7. Victim, thinking it’s a realWeb siteenters username/password
8. Attacker collectsCredentials & putsup invalid login, redirectsto real web site,takes down phishing site
9. Victim enterscredentials again& logs into real site
Phishing Examples :Email To Collect User Credentials
Received: from 101.212.50.151 by ; Tue, 15 Nov 2005 16:27:51 -0700From: "Credit Union" [email protected] PRETENDS TO BE FROM REAL ACCOUNTReply-To: "Credit Union" [email protected] PHISHING EMAIL ADDRESS/DOMAINTo: [email protected], [email protected], [email protected]:FCU: Account updateDate: Tue, 15 Nov 2005 19:28:51 -0400Content-Type: text/html;<html> HTML EMAIL (EASIER TO DISGUISE/HIDE LINKS)<head></head><body>
TwoDifferentAddresses
<body>Credit Union is constantly working to ensure security by regularly CONVINCING CONTENT screening the accounts in our system. We recently reviewed your account, and we need more information to help us provide you with secure service.Until we can collect this information, your access to sensitive account RISK, IF NO ACTIONfeatures will be limited. We would like to restore your access as soon as possible, and we apologize for the inconvenience. <br>...
How can I restore my account access? Please confirm your identity here: Restore<a href=3D"http://200.73.81.212/.CREDIT-UNION/update.php">My PHISHING WEBSITE LINKOnline Banking</a> and complete the "Steps to Remove Limitations." Completing all of the checklist items will automatically restore your account access. </html>
SMTPMailServer
Protection Techniques
IMAPMailServer
MailDatabase
1. Network Level
3. Filters/Classifiers
2. Authentication
PhishingWeb Site
Attacker
Victim
Real Web Site 4. User
ProfileFilters, Toolbars5. Mimic
Prevention
6. User Education
Methodology
n Motivation and Goalsn Backgroundn Methodologyn Implementationn Experimental Design and Resultsn Experimental Design and Resultsn Conclusions and Future Research
Methodologyn Self-Managing computing system
that adapts to unpredictable changes whilst hiding intrinsic complexity to operators and usersn Self-Protection: System detects
malicious activities and protects from such acts without human involvement
Self-Protection
Autonomic Computing
involvementn Self-Optimization: System
optimizes itself due to changes in load and environment
n Self-Configuration: System configures itself upon initialization, restart and changes in environment and load
n Self-Healing: System recovers automatically from hardware and software failures
Self-Optimization
Self-Configuration
Self-Healing
System
Self-Optimization
System Architecture
ContentClassifier (PLSA)
ContentParser
FeatureExtractor
Watch List
Link Analyzer
Fold In Technique
RepositoryUpdater
Time SliceControlWeb
Crawler
Phishing Detection Module
Adaptive Module
Event Manager
RoundRobin Module
SystemConfiguration
Fold In Technique
ExternalData Sources
Watch ListMonitor
EventCollectorEvent
AnalyzerPredictionModeler
Event Publisher
Watch ListRepository
SelfProtectionEngine
SelfOptimization
Engine
SelfConfiguration
Engine
Self HealingEngine
New AttackMonitorWorkloadMonitor
PerformanceMonitor
SoftwareUpdater
Predictive Module
Optimization Module
Repair Module
ContentCache
Phishing Detection Module
n To detect phishing attacks module will:n employ filtering model that
screens out good email from phishing email
PHISHINGEMAIL GOOD EMAIL
PHISHINGphishing email
n suspicious ones are stored for other modules use.
n employs PLSA, a natural language processing technique.
GOOD EMAIL
PHISHINGEMAIL
WATCHLIST
SUSPICIOUS
Phishing Detection Module
n PLSA makes use of “context” and term co-occurrences.n Handles topics containing “polysemy” words
n words that mean differently in different contextn Bank : river bank vs financial banking system Based on likelihood principle and has solid statistical n Based on likelihood principle and has solid statistical foundation.
n Standard statistical techniques can be applied for model fitting.
PLSA Introduction
n PLSA:n Given:
n a set of Documents, D = {d1, d2,…, },
n Find:Latent (Hidden) Topic “Z” =
TermsDocuments
Usernamen Latent (Hidden) Topic “Z” = {z1, z2, …}
n From n Vocabulary W = {w1, w2, …}
n Topic/concept probabilities are estimated based on all documents that are dealing with a concept.
n W={w1,..wj, f1,…fj}; wi – words; fi - featuresn Step 2: Initialize probabilities P(z), P(d|z), P(w|z) randomly.
Document-Term Matrix
w 1 . . . 1,j fw . . . Jf
d 1
. . .
id. . .
d I
D
W
. . .
. . .
. . .
. . .
Document-Term Matrix
PLSA : EM Algorithm
n Step 3: E-step: compute posterior probabilities for the latent variables
n Step 4 : M-step: maximize the expected complete data log-likelihoodlikelihood
n Step 5: Compute log likelihoodn Step 6: Iterate until desired threshold
PLSA – Folding-In Technique
n Estimate probability on ‘unseen’ document (dnew).n P(w|z) unchanged. Use estimates from training.n E-step:
n M-step
Implementation
n Motivation and Goalsn Backgroundn Methodologyn Implementationn Experimental Design and Resultsn Experimental Design and Resultsn Conclusions and Future Research
Phishing Detection Module
n To separate good, bad and the suspicious, module employs these functional modules:
n Content Parser: Parses content and removes noise.
n Feature Extractor: Extracts rich feature set for content classification.Link Analyzers : Analyze Link
ContentParser
FeatureExtractor
n Link Analyzers : Analyze interdependency between email & web content (not done yet)
n Content Classifier: Employs filters to separate good, bad and the suspicious.
n Fold In Technique: Compute topic distribution probabilities of a new document.
n Watch List Repository: External module that stores suspicious ones for other components use.
LinkAnalyzer
Watch ListRepository
ContentClassifier (PLSA)
Fold In Technique
Phishing Detection Module
n Training corpus is used to build filters.
n Filters are built using PLSAn Folding-In technique is used
to classify new content.
Training Repository
ValidationRepository
Fold-InTechnique
New Content
Repository
Filter DesignPLSA
Experimental Design and Results
n Motivation and Goalsn Backgroundn Methodologyn Implementationn Experimental Design and Resultsn Experimental Design and Resultsn Conclusions and Future Research
Experimental Design
n Designn PhishingCorpus data set: 750 good emails, 250 Phishing
emailsn K-fold cross validation (k = 10)n Classification using PLSA
n Content Pre Processingn Removed extraneous characters, HTML tags.n Removed stop words n Employed Porter’s stemming.
Experimental Design
n Content Parsingn Email data:
n Implemented MIME message parsing.n Parsed email headers, email text, hyperlinks and code.n Removed HTML tags.n Removed HTML tags.
n Even though PLSA’s accuracy is lower than other techniques, n Even though PLSA’s accuracy is lower than other techniques, PLSA handles different word usage and polysemy while others do not.
n Additional features could not be used by PLSA due to age of the dataset (phishing domains are short lived)
n Performance improvement is expected when email data, website data and their interdependencies of recent data set is used for detection using PLSA.
Conclusions and Future Research
n Conclusions:n Results show that PLSA identifies hidden topics (phishing versus
others).n Performance of 71% accuracy was achieved using very limited
feature set (words only) from email content.n Future Research:
Expand PLSA framework to include richer feature set (email + n Expand PLSA framework to include richer feature set (email + website + interdependencies)
n Employ prediction modeler to predict future attacks from events generated by internal and external components.
n Employ adaptive modeler to adapt to changes in attacking strategies and mode of operations.
n Employ the framework in an autonomous environment for automatic identification, prevention and protection of email servers from phishing attacks.
References
1) Ahmed Abbasi and Hsinchun Chen, A comparison of tools for detecting fake web sites, Research Feature, Computer, IEE ComputerSociety, 2009. 2) Anti-Phishing Working Group (APWG): http://www.antiphishing.org3) Arel Cordero, Tamara Blain. Catching Phish: Detecting Phishing Attacks From Rendered website Images, available athttp://www.cs.berkeley.edu/~asimma/294-fall06/projects/reports/cordero.pdf.4) Autonomic Computing, http://en.wikipedia.org/wiki/Autonomic_Computing5) Financial Service Technology Consortium (FSTC): North-America based financial institutions, technology vendors, independentresearch organizations and government agency, available athttp://www.fstc.org/projects/docs/FSTC_Counter_Phishing_Project_Whitepaper.pdf6) Hasika Pamunuwa, et. al, An Intrusion Detection System for Detecting Phishing Attacks, LNCS, September, 2007.7) Ian Fette, Norman Sadeh, Anthony Thomasic. Learning to Detect Phishing Emails, to appear in WWW 2007, available at7) Ian Fette, Norman Sadeh, Anthony Thomasic. Learning to Detect Phishing Emails, to appear in WWW 2007, available athttp://www.cs.cmu.edu/~tomasic/doc/2007/FetteSadehTomasicWWW2007.pdf8) Liu Wenyin, Guanglin Huang, Liu Xiaoyue, Zhang Min, Xiaotie Deng. . Detection of Phishing Webpages based on Visual Similarity,In WWW 2005, May 10-14, 2005, Chiba, Japan9) Neil Chou, Robert Ledesma, Yuka Teraguchi, Dan Boneh, John C. Mitchell. Client-side defense against web-based identity theft(Webspoof), available at http://www.crypto.stanford.edu/SpoofGuard/webspoof.pdf.10) Niels Provos. A Virtual Honeypot Framework, available at http://www. niels.xtdnet.nl/papers/honeyd.pdf.11) Nicolas Vanderavero, Xavier Brouckaert, Olivier Bonaventure, Baudouin Le Charlier. The HoneyTank.: A Scalable Approach tocollect malicious Internet Traffic, In international infrastructure survivability workshop (IISW’04) 2004, held in conjunction with the25th IEEE International Real-time systems symposium (RTSS04). Paper available athttp://www.info.ucl.ac.be/people/OBO/papers/honeytank.pdf.12) Thomas Hoffman, Probabilistic Latent Semantic Indexing, SIGIR, 1999.13) Ulrike von Luxburg, A Tutorial on Spectral Clustering, Statistics and Computing, 2007.14) Yu Chen, Wei-Yin Ma, Hong-Jiang Zhang. Detecting Web Page Structure for Adaptive Viewing on Small Form Factor Devices, InWWW 2003, May 20-24, 2003. 15) Yue Zhang, Jason Hong, Lorrie Cranor. CANTINA: A Content Based Approach to Detecting Phishing Sites, To appear in WWW 2007and available at www. cups.cs.cmu.edu/trust.php.