URLDoc: Learning to Detect Malicious URLs using Online Logistic Regression Presented by : Mohammed Nazim Feroz 11/26/2013
Mar 15, 2016
URLDoc: Learning to Detect Malicious URLs using Online Logistic Regression
Presented by :
Mohammed Nazim Feroz
11/26/2013
Motivation
Web services drive new opportunities for people to interact, they also create new opportunities for criminals
Google detects about 300,000 malicious websites per month, this is a clear indication that these opportunities are being used by criminals
Almost all online threats have something in common, they all require the user to click on a hyperlink or type in a website address
Motivation
The user needs to perform sanity checks and assessing the risk of visiting a URL
Performing such an evaluation might be impossible for a novice user
As a result, users often end up clicking links without paying close attention to the URLs – this further makes them vulnerable to malicious websites on the web which in turn exploit them
Introduction
Openness of the web exposes opportunities for criminals to upload malicious content
Do techniques exist to prevent malicious content from entering the web?
Current Techniques
Security practitioners have developed techniques such as blacklisting in order to protect users from malicious websites
Although this approach has minimal overhead, it does not provide complete protection as about only 55% of the malicious URLs are present in blacklists
Another drawback of this approach is that malicious websites are not a part of the blacklist during the period before their detection
Current Techniques
Security researchers have done extensive research in order to detect accounts on social networks that are used for spreading messages that are malicious
The approach still does not provide thorough protection for users in areas such as social networks where the interaction is in real-time because there is a need to build a profile of malicious activity and the process can take a considerable amount of time
Current Techniques
Researchers from TokDoc have used a method that decides on a per-token basis whether a token requires automatic healing
Their work uses n-grams and length as features for detecting malicious URLs
This research builds on their idea by supplementing a set of their features with host-based features as the latter has exhibited a wealth of information that can be used
Approach
URLDoc classifies URLs automatically based on the lexical (textual) and host-based features
Scalable machine learning algorithms from Mahout are used to develop and test the classifier
Online learning is considered over batch learning The classifier achieves 93-97% accuracy by detecting
a large number of malicious hosts, with a modest false positive rate
Approach
If these predictor variables are correctly identified and the URLs metadata is carefully derived then the machine learning algorithms used can sift through tens of thousands of features
Online algorithms are preferred over batch-learning algorithms
Batch learning algorithms look at every example in the training dataset on every step and then update the weights of the classifier – a costly operation if the number of training examples is large
Approach
Online algorithms update the weights according to the gradient of the error with respect to a single training example
Online algorithms are able to process datasets far more efficiently than batch algorithms
Problem Formulation
URL classification lends itself naturally as a binary classification problem
The target variable y(i) can take one of two possible values-malicious or benign
For k predictor variables over all categories then there will be x1(i),…, xk(i); this will result in a k-dimension feature vector characterizing the URL
The goal is to learn a function h(x)=y that maps the space of input values to the space of output values so that h(x) is a good predictor for the corresponding value of y
Problem Formulation
The two main phases involved in building a classification system
The first phase creates the model (i.e. the function h(x)) produced by the learning algorithm
The second phase makes use of that model to assign new data from the test dataset to its predicted target class
Selection of the training dataset and it’s predictor variables, the target classes, and the learning algorithm through which the classification system will learn are vital in the first phase of building the classification system
Predicted labels are compared with known answers to evaluate the classifer
Overview of Features
Lexical features These features have values of both types-binary and
continuous These features include Length of the URL Number of dots in the URL Tokens present in the hostname, primary domain, and path parts of a URL Features in the hostname are further characterized as bigrams
Bigrams are able to capture a certain pattern on character strings permuted randomly and occurring in certain combinations
Example: www.depts.ttu.edu Bigrams: depts ttu, ttu edu
Overview of Features
Host-Based features IP address of the URL – A Record IP address of the Mail Exchanger – MX Record IP address of the Name Server – NS Record PTR Record AS number IP Prefix
Overview of Features
Malicious websites have exhibited a pattern of being hosted in a particular “bad” portion of the Internet
Example: McColo provided hosting for major botnets, which in turn were responsible for sending 41% of the world’s spam just before McColo’s takedown in November 2008. McColo’s AS number was 26780
These portions of the internet can be characterized on a regular basis by retraining on the predictor variables
This allows keeping track of concept drift
Online Logistic Regression with SGD
Logistic regression is a very flexible algorithm as it allows the predictor variables to be of both types-continuous and binary
Mahout greatly helps in the learning process by choosing an optimum learning rate and thus allowing the classification system to converge to the global minimum
Online Logistic Regression with SGD
Online learning when compared to batch learning is usually much faster, adapts to changes in a continuous manner and is much better when the size of the training and test datasets are large
Support Vector Machines were considered but not chosen since they take a longer period of time to train when compared to Online Logistic Regression
Online Logistic Regression converges more quickly if malicious and benign URLs from the training dataset are presented in a random order
Feature Vector
Feature hashing is used in order to encode the raw feature data into feature vectors
In this approach, a reasonable size (i.e. dimension) is picked for the feature vector and the data is put into feature vectors of the chosen size
After carefully considering the datasets, the size of the feature vectors in the research is in the 100,000 dimension space
Feature Vector Example
The data is encoded into the feature vector as continuous, categorical, word-like, and text-like features using the Mahout API
Results
90/10 dataset split 80/20 dataset splitTraining/Test dataset split Training/Test dataset split
Results
Training/Test dataset split50/50 dataset split
Benign:Malicious
Other Approaches Attempted
Term Frequency – Inverse Document Frequency A bag of words approach was used and term (lexical features) – document
(URL) matrix was created Online Logistic Regression is not affected by good word weighting
Clustering The URLs are viewed as a set of vectors in vector space Cosine similarity was used as the similarity measure between URLs This research focused on classification over clustering since the target
classes of the URLs was known – Clustering has known to be useful when the target classes are unknown
Future Work
Study the various features extensively and only use those with the highest contributions – Also add new features that would help in better classification
Try to use algorithms that can benefit from parallelization
Summary
A reliable framework for the classification of URLs is built A supervised learning method is used in order to learn the
characteristics of both malicious and benign URLs and classify them in real time
The applicability and usefulness of Mahout for the URL classification task is demonstrated, and the benefits of using an online setting over a batch setting are illustrated-the online setting enabled learning new trends in the characteristics of URLs over time
Questions ?