Typo-Squatting: a Nuisance or a Threat to Your Traffic? Mishari Almishari
Dec 21, 2015
Typo-Squatting: a Nuisance or a Threat to Your Traffic?
Mishari Almishari
Outline
Introduction Background Methodology Parked Domain Classifier Measurements Future Work Related Work Conclusion
Introduction - Motivation
Traffic is important to web domains!• no point of launching without incoming traffic
• Loosing/Gaining traffic means loosing/gaining money
• One way to price the ADS is Pay Per Click Model
Traffic Diversion could be a serious threat to a domain
Introduction - Motivation
Typos may attract traffic• Users vulnerable to making typos
• Users may forget about visiting target domain• Threat to Target Domain!
Intentionally registering such typo domains is called Typo-squatting
Introduction - Goal
To study how much traffic typo-squatters can get from target domains• Are those domains attracting much traffic?
• There are many typo-squatting domains registered (Banerjee et al., 08)
• Search engines typo-corrections and browser auto-completions!
• How much traffic target domains are loosing?• Is it of negligible ratio or a serious threat?• Do users go back to target domains or get distracted?
Introduction - Contribution
Automatic and accurate identification of typo-squatting domains (Measurement Methodology)
Bound on how much traffic target domains are loosing towards typo-squatting domains (Measurement Results)
Outline
Introduction Background Methodology Parked Domain Classifier Measurements Related Work Future Work Conclusion
Background – Domain Parking
Domain Parking is the practice of showing a temporary page for an unused domain before launching it
Background - Domain Parking
Background – Domain Parking
Background – Domain Parking
Background – Domain Parking
Domain Parking Service• Parks and hosts unused domains
• Monetize the traffic by showing ads
Many Typo-squatting domains are parked domains (Wang et al, 06), (Keats, 07)
Outline
Introduction Background Methodology Parked Domain Classifier Measurements Future Work Related Work Conclusion
Methodology
Data Collection Identifying Typo-Squatting Domains
Methodology - Data Collection
UCI NET UCI NETINTERNETINTERNET
UCI ResolverOur Machine
DATE TIME HASHED-IP DOMAIN TYPE CLASS
USER QUERY
Methodology – Identify Typo-squatting Domain
Identify Similar Domainsa. Single Error Typo
• Single error accounts for 90-95% of spelling/typo errors (Pollock et al, 83)
• www.walmart.com and www.wamart.com
b. gTLD substitution • www.amazon.com and www.amazon.org
Methodology – Identify Typo-squatting Domains
But Similar domain is not enough!• www.abc.com and www.abd.com• www.walmart.com and www.walkmart.com• www.usps.com and www.usps.org • Random Sample
• More than 54% are not Typo-squatting
Need to Identify Hijacking Intention
Methodology – Identify Typo-squatting Domain
• Identify Hijacking Indicator Parked Domain (Ads – listing)
~ 88%
Forwarding to other domains ~ 8%
Others: Inappropriate Content, …
Parked Domain as the indicator
Methodology – Identify Typo-squatting Domain
Similar Domain Parked Domain
Typo-Squatting Domain
Methodology – Identify Typo-squatting Domain
How to identify Parked Domain?• Parked Domain Classifier
• 96%
• Presence of Parking signatures• Well-known parking signatures (domain
names/urls)
Methodology - Summary
Identify Similar Domains
Identify ParkedDomains
List of Typo-squatting
Domains
Outline
Introduction Background Methodology Parked Domain Classifier Measurements Future Work Related Work Conclusion
Parked Domain Classifier
Build Data Set
Extract Core Features
Combine Into Classifier
Data Set
Data Set consists of 2,800 domains 700 are parked domain
• Collected from MS Strider Website
2,100 are non-parked domains
• Collected From the fourteen Yahoo Directory Top Categories
Feature Selection
• Heuristically, Identify common features in parked domain
• Compute the distribution of those features for verification
•Common Link Ratio Max
Combining Features Into Classifier
Tried Different Classifier Algorithms• Decision Tree
• SVM
• K-Nearest Neighbor
• Random Forest
• The best performance
Outline
Introduction Background Methodology Parked Domain Classifier Measurements Future Work Related Work Conclusion
DATA Sets
DNS Traces• Four Months
• ~ 30 million domains (~ 2 billion hits) (~ 30,000 users)
Target Domain Set• Alexa’s Top 500 popular domains
• ~53,000,000 hits
Typo-Squatting Domains & Hits
1,332 typo-squatting 13,431 hits (~ 110 a day) Is it Large or Small?
• 500 Target Domains
• 4 Month Period
• ~ 30,000 users
• Given Similar Ratio may translate to non-trivial number
• 30,000 => 110 Per Day
• 300,000 => 1,100 Per Day
• 3000,000 => 11,000 (X 365 = ~ 4,000,000 A YEAR)
Typo-squatting Ratio
• 0.025% of total number of queries
• (89% , ≤ 1%) (70%, ≤ 0.1%) ( 57%, ≤ 0.01%)
User Correction Ratio – Alexa-500
• 54% of typo-squatting queries are corrected
• ~ 51% squatted target domains have most squat hits corrected
Potential Hit Loss
• Potential Hit Loss Ratio = 0.012%
• (92% , ≤ 1%) (78%, ≤ 0.1%) (64%, ≤ 0.01%)
Potential Money Loss
• ~75% do not point to target domains
• Referring Typo-Sqt Ratio = 0.008%
• (96%, ≤ 1%) (91%, ≤ 0.1%) ( 81%, ≤ 0.01%)
Typo-Squatting Distribution
•19 % of all Typo-squatting hits
Typo Characterization
• Most Typos are single errors (95% VS 5%)
• Most gTLD sub are “com” to “org” (50%)
• Add – 37 % are of non-adjacent keys
• Sub – 77% are of non-adjacent keys
• Sub – 13% of substitutions are “a” and “o”
•Spelling error
Typo-squatting Domains – TP60
• 15,499 hits
• 0.045% of total number of queries
• (76%, ≤ 1%) (60%, ≤ 0.5%)
Outline
Introduction Background Methodology Parked Domain Classifier Measurements Future Work Related Work Conclusion
Future Work
How much of the ads budget go to squatters? Enhance our identification technique See, if the results hold at other ISPs Typo Modeling for getting traffic back
Outline
Introduction Background Methodology Parked Domain Classifier Measurements Future Work Related Work Conclusion
Related Work
MS Strider Project [Wang et al. Sruti06] McAfee Study [Keats McAfee White
Paper 07] JAAL project [Banerjee et al. Infocom 08]
Outline
Introduction Background Methodology Parked Domain Classifier Measurements Future Work Related Work Conclusion
Conclusion
Accurately and automatically identify typo-squatting domains
How much traffic go to typo-squatters Bound on how much traffic the target domain is
loosing towards typo-squatting
• inconsequential