CISC 879 - Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning to Detect and Identify Malicious Executables in Wild J. Zico Kotler Marcus A Maloof
CISC 879 - Machine Learning for Solving Systems Problems
Presented by: Ashwani RaoDept of Computer & Information Sciences
University of Delaware
Learning to Detect and Identify Malicious Executables in Wild
J. Zico KotlerMarcus A Maloof
CISC 879 - Machine Learning for Solving Systems Problems
Introduction
• Machine learning and data mining to identify malicious code
• Malicious Codes ?
• Why not antivirus suites?
• Training set: 1971 good and 1651 malicious executables
• Features extracted: n-gram byte code and executable based on their functions of payload
• Learning algorithms: naïve bayes, SVM, decision trees and boosting
CISC 879 - Machine Learning for Solving Systems Problems
Goals of the research Paper
• How to use established methods to detect and classify malicious executables ?
• Present empirical results from an extensive study of inductive methods for detection and classification
• To show that methods achieve high detection rates on new and unseen executables.
CISC 879 - Machine Learning for Solving Systems Problems
Related Work
• Lo et al., 1995; Kephart et al., 1995; Tesauro et al.,1996;Schultz et al.,2001
• Lo et al., 1995: analysis of several programs
• Schultz et al.2001, used data mining to detect
• Binary profiling (Ripper learning)
• String Sequences (Naïve Bayes)
• Hex dumps (six naïve bayesian classifiers)
CISC 879 - Machine Learning for Solving Systems Problems
Data Collection and Classification methods• 1971 benign and 1651 malicious executables of
windows pe format
• N-grams: Combine each four bye sequence into single term. For e.g.: ff 00 ab 3e 12 b3 , the corresponding n-grams are ff00ab3e, 00ab3e12, ab3e12b3 etc.
• N-gram: each of them are considered as attributes
• Most relevant attribute (n-grams) are calculated using Information gain also called average mutual information. Collected 500 most relevant n-grams
CISC 879 - Machine Learning for Solving Systems Problems
Classification methods
CISC 879 - Machine Learning for Solving Systems Problems
Classification methods
• Instance based learner: Collection of training examples
• Naive bayes: Probablisitc model. Based on condition probability of each class P(Ci) and P(Vj | Ci)
CISC 879 - Machine Learning for Solving Systems Problems
Classification methods
• Support Vector machines: vector of weights w and threshold,b. Uses a kernel function to map training data into higher dimensioned space so that problem is linearly separable.
• Decision Trees: Internal nodes correspond to attributes and leaf nodes corresponds to class labels.
• Boosted classifiers: It is method for combining multiple classifiers. Boosting produces set of weighted models by iteratively learning a model from a weighted data set, evaluating it and reweighting the data set based on model’s performance.
CISC 879 - Machine Learning for Solving Systems Problems
Detecting malicious code using n-grams
• Used Ten-fold cross validation
• Pilot Study: To determine the size of n-grams and number of n-grams relevant. Used n-grams with n=4 and calculated the best number of n-grams using Information gain. 500 relevant n-grams produced the best result.
• Experiment With Small collection: Small collection of executable with total of 68,744,909 n-grams
• Experiment with Large Collection: 255 million distinct n-grams of size of 4.
CISC 879 - Machine Learning for Solving Systems Problems
Results of Small Collection
• ROC curve for detecting malicious executables in small collection
CISC 879 - Machine Learning for Solving Systems Problems
Result of Bigger Collection
• ROC Curve for bigger collection
CISC 879 - Machine Learning for Solving Systems Problems
Classifying executables by Payload function• Extent to which classification methods could
determine whether a given malicious executable opened a backdoor, mass mailed or was an executable virus.
• Identify and enumerate the functions of payloads
• Many executables fell into many categories
• Experimental design similar to previous but for each of the fucntion data set is made from malicious executables only.
• Used ten fold Cross validation
CISC 879 - Machine Learning for Solving Systems Problems
Experimental Results
• ROC curve for mass mailing capabilities
CISC 879 - Machine Learning for Solving Systems Problems
Experimental Results
• ROC Curve for backdoor entries
CISC 879 - Machine Learning for Solving Systems Problems
Evaluating Real World Online Performance• Applied method to 291 real world malicious code to
discovered after the original data were gathered
• Classifiers from the original data were build for both benign and malicious code
• Boosted decision tree detected 98% of the new malicious code.
CISC 879 - Machine Learning for Solving Systems Problems
Conclusion and Future work
• Machine learning and data mining are useful and appropriate tool for detection of malware
• Boosted Classifiers, support vector machines performed exceptionally well
• Boosting removes bias and variance and outperformed other classifiers in the study
• This approach is scalable
• 20-25 % of the codes were obfuscated using compression and encryption
• For functions of payload experiments remove obfuscation and rerun the experiments with larger set
CISC 879 - Machine Learning for Solving Systems Problems
Conclusion and Future Work
• Similarity of malicious code and how such executables change over time. Clustering can provide good insight into this.
• This approach combined with search for known signatures, executing and analyzing code in virtual machine will provide better computer security
CISC 879 - Machine Learning for Solving Systems Problems
Q&A ?