U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically Identify Common Behaviors
Mar 27, 2015
U.S. Army Research, Development and Engineering Command
Jaime C. Acosta, Ph.D.
Using the Longest Common Substring on Dynamic Traces of Malware to Automatically Identify Common Behaviors
Research Question
Can we reduce redundant analysis by finding common behaviors in malware instances?
Malware Analysis
Dynamic Analysis: Run the malware instance (binary) in a controlled environment
– Log all events (registry, memory, sockets, etc.)– Analyze logs for malicious behavior – Find similar malware instances based on
runtime behavior
Malware Analysis
Event Logs
00
01
02
03
…
Malware A
Event
Codes
Initialize network socket
Establish connection to
malicious.com
Load library
Sleep
Malware Instance Similarity
Event n-grams (Rieck et al. 2010)– Find common n-grams (or sequences of events) in event
logs
01, 02;
02, 03;
2-grams for
Malware A /
Malware B
00
01
02
03
…
Malware A
01
02
03
02
…
Malware B
04
02
05
01
02
…
Malware C
01, 02;
2-grams for
Malware A /
Malware C01, 02;
2-grams for
Malware B /
Malware C
Events
Codes
Malware Instance Similarity
Event n-grams (Rieck 2010)– Find common fixed size n-grams (or sequences of events)
in event logs
Malware A / Malware B are more likely to be of the same “type”
01, 02;
02, 03;
2-grams for
Malware A /
Malware B
00 …
01
02
03
…
Malware A
01 …
02
03
02
…
Malware B
04 …
02
05
01
02
…
Malware C
01, 02;
2-grams for
Malware A /
Malware C01, 02;
2-grams for
Malware B /
Malware C
Malware Instance Similarity
Limitations for post analysis– Lose context given by varied-length sequences
00
01
02
03
04
05
…
Malware A
Event
Codes
Initialize network socket
Establish connection to
malicious.com
Load library
Sleep
…
Install a rootkit
Malware Instance Similarity
Limitations for post analysis– Lose context given by varied-length sequences– Lose commonalities between different “types” of
malware
<“Install rootkit”>
08
…
Malware A
06
04
<“Install rootkit”>
00
01
…
Malware B
00
<“Install rootkit”>
…
Malware C
Approach
Common Substrings Algorithm– Based on the Longest Common Substring– Finds all common event sequences of
minimum (not fixed) length n between trace files in a dataset
Approach
Malheur Reference Dataset– Dynamic traces of 3131 malware instances
• Generated with CWSandbox• Trace size ranges from 700B to 3.4MB• Collected in August 2009
Approach
Malheur Reference Dataset– Traces split into 2 sets
Small Set (<100KB) Large Set (>=100KB)Total # malware instance trace files 2,071 1,060Total # events 1,217,985 17,400,262Total size of malware instance trace files 44 MB 490 MB
Approach
Goal– Reduce redundant analysis, especially in larger malware
• First, find common substrings within small malware traces
• Next, reduce analysis workload by removing redundancies in larger malware traces
Approach – Common Substrings Algorithm
Input: Malware dynamic traces of the small set (size < 100KB)
00 …
01
02
03
…
Malware A
04 …
05
06
02
…
Malware D
01 …
02
06
02
…
Malware B
02 …
03
00
04
…
Malware E
04 …
02
03
00
…
Malware C
04 …
05
06
00
…
Malware F
Events
Output:
Common substrings matrix
X X X X X X
… X X X X X
… … X X X X
… … … X X X
… … … … X X
… … … … … X
A B C D E F
A
B
C
D
E
F
All common substrings between
Pairs of malware traces
Approach – Common Substrings Algorithm
Iteration 0
00
01
02
03
…
Malware A
01
02
06
02
…
Malware B“” “” “” “”
“” “” “” “”
“” “” “” “”
“” “” “” “”
00 01 02 03
01
02
06
02
Malware A
Ma
lwa
re B
Approach – Common Substrings Algorithm
Iteration 1
00
01
02
03
…
Malware A
01
02
06
02
…
Malware B“” “01” “” “”
“” “” “” “”
“” “” “” “”
“” “” “” “”
00 01 02 03
01
02
06
02
Malware A
Ma
lwa
re B
Approach – Common Substrings Algorithm
Iteration 2 – match found, merge with upper left corner
“” “01” “” “”
“” “” “01,02” “”
“” “” “” “”
“” “” “” “”
00 01 02 03
01
02
06
02
Malware A
Ma
lwa
re B
00
01
02
03
…
Malware A
01
02
06
02
…
Malware B
Approach – Common Substrings Algorithm
Final Iteration
“” “01” “” “”
“” “” “01,02” “”
“” “” “” “”
“” “” “02” “”
00 01 02 03
01
02
06
02
Malware A
Ma
lwa
re B
We have 2 common substrings.
We only keep those with minimum substring length 2
00
01
02
03
…
Malware A
01
02
06
02
…
Malware B
Approach – Common Substrings Algorithm
Selecting which Common Substrings to keep
Common Substrings Matrix“” “01” “” “”
“” “” “01,02” “”
“” “” “” “”
“” “” “02” “”
00 01 02 03
01
02
06
02
Malware A
Ma
lwa
re B
We have 2 common substrings.
We only keep those with minimum substring length 2
X X X X X X01,02
X X X X X
X X X X
X X X
X X
X
A B C D E F
A
B
C
D
E
F
Approach – Common Substrings Algorithm
Unique common substrings are merged
X X X X X X01,0202,03,04 X X X X X03,02,24,46,35
01,0202,03,04 X X X X
03,02,20,40,35
03,02,20,40,3,5
03,02,20,40,3,5 X X X
03,02,24,40,36
03,02,20,40,3,5
03,02,20,40,3,5
03,02,20,40,3,5 X X
01,02,54,409,35
03,02,20,40,3,5
03,02,20,40,3,5
03,02,20,40,3,5
03,02,20,40,3,5 X
A B C D E F
A
B
C
D
E
F
03,02,20,40,35;
03,02,02,02,03;
01,02,02;
00,02;
03,02;
…
Small set (<100KB)
common substrings
Approach – Common Substrings Algorithm
Doesn’t that take a lot of space?– Many shared common substrings– Total size of all unique common substrings was 25MB
Doesn’t that take a lot of processing time?– Can be run on separate processes with multithreading– GPU
Approach
Find and remove common substrings in large set (size >= 100KB)
03,02,20,40,35;
03,02,02,02,03;
01,02,02;
00,02;
03,02;
…
Small set (<100KB)
common substrings
00
02
02
03
…
Malware AA
00
01
02
02
…
Malware BB
00
01
03
02
…
Malware CC
<removed>
<removed>
02
03
…
Malware AA
00
<removed>
<removed>
<removed>
…
Malware BB
00
01
<removed>
<removed>
…
Malware CC
40% shared
30% shared
50% shared
Approach
Find and remove common substrings in large set (size >= 100KB)
03,02,20,40,35;
03,02,02,02,03;
01,02,02;
00,02;
03,02;
…
Small set (<100KB)
common substrings
00
02
02
03
…
Malware AA
00
01
02
02
…
Malware BB
00
01
03
02
…
Malware CC
<removed>
<removed>
02
03
…
Malware AA
00
<removed>
<removed>
<removed>
…
Malware BB
00
01
<removed>
<removed>
…
Malware CC
40% shared
30% shared
50% shared
Average = 40%
Approach
Find and remove common substrings in large set (size >= 100KB)
03,02,20,40,35;
03,02,02,02,03;
01,02,02;
00,02;
03,02;
…
Small set (<100KB) common
substrings
00
02
02
03
…
Malware AA
00
01
02
02
…
Malware BB
00
01
03
02
…
Malware CC
<removed>
<removed>
02
03
…
Malware AA
00
<removed>
<removed>
<removed>
…
Malware BB
00
01
<removed>
<removed>
…
Malware CC
40% shared
30% shared
50% shared
This process was run several times with
minimum length sizes 2 to 100
Results
Analyst’s dream: Many long common substrings are shared with the larger set
Results
A
B
C
•A - Not too interesting finding common pairs of instructions is
expected and will not reduce redundant analysis by much
Results
A
B
C
•B - Indicates that small traces can be analyzed thus reducing the
larger set analysis by about half
Results
A
B
C
•C - Some reassurance that the dataset was reasonably diverse
Contributions
– The common substring algorithm is capable of identifying similarities in dynamic traces of malware
– Redundant event sequences can be identified to reduce analysis
– Commonalities are not limited to short event sequences
Future Work
– Use behavior templates• For example: regular expressions to identify a
recurring sequences (5 vs. 10 sleep events)– Develop a user interface– Optimization
• GPU
Questions
Sample Common Substrings
Retrieve file from server and replace system file
– Load library
– Connect
– Download
– Check if exists
– Remove
– Copy
– Remove evidence
Dataset Reference
• http://pi1.informatik.uni-mannheim.de/malheur/