1 CANTINA : A Content-Based Approach to Detecting Phishing Web Sites WWW 2007 2008.09.09 Yue Zhang, Jason Hong, and Lorrie Cranor.

1

CANTINA : A Content-Based Approach to Detecting Phishing Web Sites

WWW 2007

2008.09.09

Yue Zhang , Jason Hong, and Lorrie Cranor

CS710 | KAIST

Agenda

Phishing Attacks Motivation & Goal Relative Work CANTINA Evaluation Conclusion

2

CS710 | KAIST

Phishing Attacks(1/2)

The Act of stealing personal information via the internet for the purpose of committing financial fraud Create a faked site similar to original sites like bank Send to users using variable methods

• Spam e-mail, XSS vulnerabilities, Malware … Technical issues

URL Obfuscation• Similar domain, Encoding URL…

DNS hijacking• Modifying hosts file, DNS server setting…

Malware• BHO(Browser Helper Object), Browser Toolbar, Key logger…

3

CS710 | KAIST

Phishing Attacks(2/2)

Criminals often create phishing sites by copying and then modifying a legitimate site’s web pages Similar to original web site

Often contain brand names and other terms that are common on a given web page Owner’s brands

4

CS710 | KAIST

Motivation & Goal

Phishing is a rapidly growing problem with 9,255 unique phishing sites reported in 2006

84 Anti-phishing toolbars Low accuracies There is a strong need for better automated detection

algorithms

A novel content-based approach for detecting phishing web sites. Accomplish the accuracy more than existing approach

5

CS710 | KAIST

Related work(1/3)

Anti-Phishing has four categories Why People Fall for Phishing Attacks?

• Have examined the reasons that people fall for phish-ing attacks

Educating people about Phishing Attacks• Focused on online training materials, testing and sit-

uated learning Anti-Phishing User Interface

• Focused on the development of better user interface for anti-phishing tools

Automated Detection of Phishing

6

CS710 | KAIST

Relative work(2/3)

Anti-Phishing user interface Toolbar-based approach

Browser extensions• Dynamic Security Skins• Web Wallet

7

CS710 | KAIST

Relative Work(3/3)

Automated detection of phishing To use heuristics to judge whether a page has phishing

characteristics.• Host name, domain name, URLs,…

To use a blacklist that lists reported phishing URLs

8

CS710 | KAIST

CANTINA | Basic Concept

Criminals often create phishing sites by copying and then modifying a legitimate site’s web pages Contain brand names and terms of legitimate pages

Robust Hyperlinks To find a broken links Add lexical signature to URLs

• If link doesn’t work, then feed signature to search engine• Ex. http://aaa.com/a.html?lexical-signature==“word1+word2+...+word5”

TF/IDF (Term frequency/Inverse document frequency) Frequency based algorithm. Basic algorithm for search engine

• comparing and classifying documents• A term has a high TF-IDF weight by having a high

term frequency in a given document

9

CS710 | KAIST


10

Web pageCalculate TF-IDF weight of each term

Take the five terms with highest TF-IDF weight

Search top file term(term1+term2..) using google

Compare the domain name with google search results

Phishing site : domain name of current page do not match the domain name of the N top search results (30)

CS710 | KAIST


eBay, user, sign, help, forgot

Faked Page

TF/IDF Top 5 :

CS710 | KAIST


eBay, user, sign, help, forgot

Real Page

TF/IDF Top 5 :

CS710 | KAIST


CS710 | KAIST

CANTINA | Additional Solutions

Basic CANTINA has a number of false positive

Solutions Add the current domain name to the lexical signature

ZMP(Zero results Means Phishing)• Google returns zero search results

– Meaningless domain(e.g., “u-s-j.be”)

Larger set of heuristics based on related work• From existing approach (e.g., SpoofGuard, PILFER)• Age of Domain, Known Images, Suspicious URL,…

14

CS710 | KAIST

Evaluation | Effectiveness #1(1/2)

Four conditions Basic TF-IDF Basic TF-IDF + domain name Basic TF-IDF + ZMP Basic TF-IDF + domain + ZMP

100 phishing URLs and 100 legitimate URLs Phishing URLs : PhishTank.com Legitimate URLs : From previous study

15

CS710 | KAIST


16

Basic TF-IDF + ZMP + domain False positives a little high Final TF-IDF

CS710 | KAIST


Want to reduce false positives Combining several heuristics method

17

CS710 | KAIST


Determining the best weights for these heuristics is a typical classification problem. Use a simple forward linear model Used 100 phishing URLs, 100 legitimate to find weights

18

CS710 | KAIST


To evaluate the effectiveness of Final-TF-IDF, Final-TD-IDF+heuristics, SpoofGuard, and Netcraft SpoofGuard : the highest true positive rate

• Relies entirely on heuristics Netcraft : one of the best toolbars overall

• Uses a combination of heuristics and an extensive blacklist.

100 phishing URLs from PhishTank.com 100 legitimate URLs

35 sites often attacked (citibank. Papayl) 35 top pages from Alexa ( most popular sites) 30 random web pages from random.yahoo.com

19

CS710 | KAIST


20

Reduced false positives from 6% to 1% by com-bining Final-TF-IDF with simple heuristics But, true positive was decreased

CS710 | KAIST

Discussion

Limitations Does not apply to non-English web sites System Performance

• Depend on performance of Google search engine

Attacks by criminals use image instead of words Add invisible text Circumventing TF-IDF and PageRank

• Using “Google Bombs” Attempt a DoS attack on Google

21

CS710 | KAIST

Conclusion

CANTINA uses TF-IDF + search engines + heuristics to find phishing web sites 97% true positives with 6% false positives 89% true positives with 1% false positives

Shifts problem of identifying phishing sites to a search en-gine problem

22

CS710 | KAIST 23

Q&A

1 CANTINA : A Content-Based Approach to Detecting Phishing Web Sites WWW 2007 2008.09.09 Yue Zhang, Jason Hong, and Lorrie Cranor.

Documents