Fourteenforty Research Institute, Inc. FFRI, Inc. http://www.ffri.jp Improving accuracy of malware detection by filtering evaluation dataset based on its similarity Junichi Murakami Director of Advanced Development
May 06, 2015
FFRI, Inc.
Fourteenforty Research Institute, Inc.
FFRI, Inc. http://www.ffri.jp
Improving accuracy of malware detection by filtering evaluation dataset based on its similarity
Junichi Murakami Director of Advanced Development
FFRI, Inc.
• This slides was used for a presentation at CSS2013
– http://www.iwsec.org/css/2013/english/index.html
• Please refer the original paper for the detail data
– http://www.ffri.jp/assets/files/research/research_papers/MWS2013_paper.pdf (Written in Japanese but the figures are common)
• Contact information
– @FFRI_Research (twitter)
Preface
2
FFRI, Inc.
• Background
• Problem
• Scope and purpose
• Experiment 1
• Experiment 2
• Experiment 3
• Consideration
• Conclusion
Agenda
3
FFRI, Inc.
Background – malware and its detection
4
Increasing
malware
Targeted Attack
(Unknown malware)
Malware generators
Obfuscators
Limitation of
signature matching
other methods
Heuristic
Could reputation
Machine learning Bigdata
FFRI, Inc.
Background – Related works
5
Features
Static information
Dynamic information
Hybrid
Algorithms
SVM
Naive bayes
Perceptron, etc.
Evaluation
TPR/FRP, etc.
ROC-curve, etc.
Accuracy, Precision
• Mainly focusing on a combination of the factors below
– Features selection and modification, parameter settings
• Some good results are reported (TRP:90%+, FRP:1%-)
FFRI, Inc.
• General theory of machine learning:
– Accuracy of classification declines if trends of training and testing data are different
• How about malware and benign files
Problem
6
? ?
FFRI, Inc.
① Investigating differences between similarities of malware and benign files(Experiment-1)
② Investigating an effect for accuracy of classification by the difference(Experiment-2)
③Based on the result above, confirming an effect of removing data whose similarity with a training data is low (Experiment-3)
Scope and purpose
7
FFRI, Inc.
• Used FFRI Dataset 2013 and benign files we collected as datasets
• Calculated the similarity of each malware and benign files (Jubatus, MinHash)
• Feature vector: A number of 4-gram of sequential API calls
– ex: NtCreateFile_NtWriteFile_NtWriteFile_NtClose: n times NtSetInformationFile_NtClose_NtClose_NtOpenMutext: m times
Experiment-1(1/3)
8
malware
benign A B C ...
A
B
C
...
A B C ...
A ー 0.8 0.52 ...
B ー ー 1.0 ...
C ー ー ー ...
... ー ー ー ー
FFRI, Inc.
Grouping malware and benign files based on their similarities
Experiment-1(2/3)
9
Threshold of similarity (0.0 - 1.0) benign
malware
FFRI, Inc.
Experiment-1(3/3)
10
0%
20%
40%
60%
80%
100%
正常
系
マル
ウェ
ア
正常
系
マル
ウェ
ア
正常
系
マル
ウェ
ア
正常
系
マル
ウェ
ア
正常
系
マル
ウェ
ア
0.8 0.85 0.9 0.95 1
仲間無
仲間有
Threshold of similarity
It is more difficult to find similar benign files compared to malware
malw
are
malw
are
malw
are
malw
are
malw
are
benig
n
benig
n
benig
n
benig
n
benig
n
unique
not unique
FFRI, Inc.
• How much does the difference affect a result?
• 50% of malware/benign are assigned to a training, the others are to a testing dataset(Jubatus, AROW)
Experiment-2(1/3)
11
benign
malware
train
jubatus
classify
jubatus TPR: ?
FPR: ?
TPR: True Positive Rate FPR: False Positive Rate
train
testi
ng
FFRI, Inc.
Experiment-2(2/3)
12
benign
malware
train
jubatus
classify
jubatus TPR: ?
FPR: ?
train
testi
ng
• How much does the difference affect a result?
• 50% of malware/benign are assigned to a training, the others are to a testing dataset(Jubatus, AROW)
FFRI, Inc.
The accuracy declines if trends of training and testing data are different
Experiment-2(3/3)
13
0 50 100 0 1 2 3 4 5
■TPR ■FPR
97.996(not unique)
81.297(unique)
0.624(not unique)
4.49(unique)
-16.699
+3.866
% %
FFRI, Inc.
14
benign(train) malware(train)
benign(test) malware(test)
dividing line
Experiment-3(1/6) – After a training
malware
benign
FFRI, Inc.
Experiment-3(2/6) – After a classification
15
benign(train) malware(train)
benign(test) malware(test)
dividing line
FFRI, Inc.
16
FP
FN
Experiment-3(2/6) – After a classification
benign(train) malware(train)
benign(test) malware(test)
dividing line
FFRI, Inc.
Experiment-3(3/6) – Low similarity data
17
TP(accidentally)
FN
FN
benign(train) malware(train)
benign(test) malware(test)
dividing line
FFRI, Inc.
Experiment-3(4/6) – Effect to TPR
18
0.88
0.90
0.92
0.94
0.96
0.98
1.00
0
200
400
600
800
1000
1200
1400
0 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
TP
FN
TPR
Threshold of similarity
The n
um
ber
of cla
ssifie
d d
ata
FFRI, Inc.
Experiment-3(5/6) – Effect to FPR
19
0.000
0.002
0.004
0.006
0.008
0.010
0.012
0.014
0
500
1000
1500
2000
2500
0 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
TN
FP
FPR
The n
um
ber
of cla
ssi
fied d
ata
Threshold of similarity
FFRI, Inc.
Experiment-3(6/6)
20
0%
20%
40%
60%
80%
100%
120%
0 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
マルウェア 正常系ソフトウェア
Threshold of similarity
The n
um
ber
of cla
ssifie
d d
ata
/ The n
um
ber
of to
tal te
stin
g d
ata
Transition of the number of classified data
malware benign
FFRI, Inc.
• In real scenario:
– trying to classify an unknown file/process whether it is benign files or not
• If we apply Experiment-3:
– Files are classified only if similar data is already trained
– If not, files are not classified which results in
• FN if the files is malware
• TF if the files is benign (All right as a result)
• Therefore it is a problem about “TPR for unique malware” (Unique malware is likely to be undetectable)
Consideration(1/3)
21
FFRI, Inc.
• If malware have many variants as the current
– ML-based detection works well
• Having many variants ∝ malware generators/obfuscators
• We have to investigate
– Trends of usage of the tools above
– Possibility of anti-machine learning detection
Consideration(2/3)
22
FFRI, Inc.
• How to deal with unclassified (filtered) data
1. Using other feature vectors
2. Enlarging a training dataset (Unique → Not unique)
3. Using other methods besides ML
Consideration(3/3)
23
FFRI, Inc.
• Distribution of similarity for malware and benign are difference (Experiment-1)
• Accuracy declines if trends of training and testing data are different (Experiment-2)
• TPR of unique malware declines when we remove low similarity data (Experiment-3)
• Continual investigation for trends of malware and related tools are required
• (Might be necessary to develop technology to determine benign files)
Conclusion
24