U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Post on 27-Mar-2015

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

U.S. Army Research, Development and Engineering Command

Jaime C. Acosta, Ph.D.

Using the Longest Common Substring on Dynamic Traces of Malware to Automatically Identify Common Behaviors

Research Question

Can we reduce redundant analysis by finding common behaviors in malware instances?

Malware Analysis

Dynamic Analysis: Run the malware instance (binary) in a controlled environment

– Log all events (registry, memory, sockets, etc.)– Analyze logs for malicious behavior – Find similar malware instances based on

runtime behavior

Malware Analysis

Event Logs

00

01

02

03

Malware A

Event

Codes

Initialize network socket

Establish connection to

malicious.com

Load library

Sleep

Malware Instance Similarity

Event n-grams (Rieck et al. 2010)– Find common n-grams (or sequences of events) in event

logs

01, 02;

02, 03;

2-grams for

Malware A /

Malware B

00

01

02

03

Malware A

01

02

03

02

Malware B

04

02

05

01

02

Malware C

01, 02;

2-grams for

Malware A /

Malware C01, 02;

2-grams for

Malware B /

Malware C

Events

Codes

Malware Instance Similarity

Event n-grams (Rieck 2010)– Find common fixed size n-grams (or sequences of events)

in event logs

Malware A / Malware B are more likely to be of the same “type”

01, 02;

02, 03;

2-grams for

Malware A /

Malware B

00 …

01

02

03

Malware A

01 …

02

03

02

Malware B

04 …

02

05

01

02

Malware C

01, 02;

2-grams for

Malware A /

Malware C01, 02;

2-grams for

Malware B /

Malware C

Malware Instance Similarity

Limitations for post analysis– Lose context given by varied-length sequences

00

01

02

03

04

05

Malware A

Event

Codes

Initialize network socket

Establish connection to

malicious.com

Load library

Sleep

Install a rootkit

Malware Instance Similarity

Limitations for post analysis– Lose context given by varied-length sequences– Lose commonalities between different “types” of

malware

<“Install rootkit”>

08

Malware A

06

04

<“Install rootkit”>

00

01

Malware B

00

<“Install rootkit”>

Malware C

Approach

Common Substrings Algorithm– Based on the Longest Common Substring– Finds all common event sequences of

minimum (not fixed) length n between trace files in a dataset

Approach

Malheur Reference Dataset– Dynamic traces of 3131 malware instances

• Generated with CWSandbox• Trace size ranges from 700B to 3.4MB• Collected in August 2009

Approach

Malheur Reference Dataset– Traces split into 2 sets

Small Set (<100KB) Large Set (>=100KB)Total # malware instance trace files 2,071 1,060Total # events 1,217,985 17,400,262Total size of malware instance trace files 44 MB 490 MB

Approach

Goal– Reduce redundant analysis, especially in larger malware

• First, find common substrings within small malware traces

• Next, reduce analysis workload by removing redundancies in larger malware traces

Approach – Common Substrings Algorithm

Input: Malware dynamic traces of the small set (size < 100KB)

00 …

01

02

03

Malware A

04 …

05

06

02

Malware D

01 …

02

06

02

Malware B

02 …

03

00

04

Malware E

04 …

02

03

00

Malware C

04 …

05

06

00

Malware F

Events

Output:

Common substrings matrix

X X X X X X

… X X X X X

… … X X X X

… … … X X X

… … … … X X

… … … … … X

A B C D E F

A

B

C

D

E

F

All common substrings between

Pairs of malware traces

Approach – Common Substrings Algorithm

Iteration 0

00

01

02

03

Malware A

01

02

06

02

Malware B“” “” “” “”

“” “” “” “”

“” “” “” “”

“” “” “” “”

00 01 02 03

01

02

06

02

Malware A

Ma

lwa

re B

Approach – Common Substrings Algorithm

Iteration 1

00

01

02

03

Malware A

01

02

06

02

Malware B“” “01” “” “”

“” “” “” “”

“” “” “” “”

“” “” “” “”

00 01 02 03

01

02

06

02

Malware A

Ma

lwa

re B

Approach – Common Substrings Algorithm

Iteration 2 – match found, merge with upper left corner

“” “01” “” “”

“” “” “01,02” “”

“” “” “” “”

“” “” “” “”

00 01 02 03

01

02

06

02

Malware A

Ma

lwa

re B

00

01

02

03

Malware A

01

02

06

02

Malware B

Approach – Common Substrings Algorithm

Final Iteration

“” “01” “” “”

“” “” “01,02” “”

“” “” “” “”

“” “” “02” “”

00 01 02 03

01

02

06

02

Malware A

Ma

lwa

re B

We have 2 common substrings.

We only keep those with minimum substring length 2

00

01

02

03

Malware A

01

02

06

02

Malware B

Approach – Common Substrings Algorithm

Selecting which Common Substrings to keep

Common Substrings Matrix“” “01” “” “”

“” “” “01,02” “”

“” “” “” “”

“” “” “02” “”

00 01 02 03

01

02

06

02

Malware A

Ma

lwa

re B

We have 2 common substrings.

We only keep those with minimum substring length 2

X X X X X X01,02

X X X X X

X X X X

X X X

X X

X

A B C D E F

A

B

C

D

E

F

Approach – Common Substrings Algorithm

Unique common substrings are merged

X X X X X X01,0202,03,04 X X X X X03,02,24,46,35

01,0202,03,04 X X X X

03,02,20,40,35

03,02,20,40,3,5

03,02,20,40,3,5 X X X

03,02,24,40,36

03,02,20,40,3,5

03,02,20,40,3,5

03,02,20,40,3,5 X X

01,02,54,409,35

03,02,20,40,3,5

03,02,20,40,3,5

03,02,20,40,3,5

03,02,20,40,3,5 X

A B C D E F

A

B

C

D

E

F

03,02,20,40,35;

03,02,02,02,03;

01,02,02;

00,02;

03,02;

Small set (<100KB)

common substrings

Approach – Common Substrings Algorithm

Doesn’t that take a lot of space?– Many shared common substrings– Total size of all unique common substrings was 25MB

Doesn’t that take a lot of processing time?– Can be run on separate processes with multithreading– GPU

Approach

Find and remove common substrings in large set (size >= 100KB)

03,02,20,40,35;

03,02,02,02,03;

01,02,02;

00,02;

03,02;

Small set (<100KB)

common substrings

00

02

02

03

Malware AA

00

01

02

02

Malware BB

00

01

03

02

Malware CC

<removed>

<removed>

02

03

Malware AA

00

<removed>

<removed>

<removed>

Malware BB

00

01

<removed>

<removed>

Malware CC

40% shared

30% shared

50% shared

Approach

Find and remove common substrings in large set (size >= 100KB)

03,02,20,40,35;

03,02,02,02,03;

01,02,02;

00,02;

03,02;

Small set (<100KB)

common substrings

00

02

02

03

Malware AA

00

01

02

02

Malware BB

00

01

03

02

Malware CC

<removed>

<removed>

02

03

Malware AA

00

<removed>

<removed>

<removed>

Malware BB

00

01

<removed>

<removed>

Malware CC

40% shared

30% shared

50% shared

Average = 40%

Approach

Find and remove common substrings in large set (size >= 100KB)

03,02,20,40,35;

03,02,02,02,03;

01,02,02;

00,02;

03,02;

Small set (<100KB) common

substrings

00

02

02

03

Malware AA

00

01

02

02

Malware BB

00

01

03

02

Malware CC

<removed>

<removed>

02

03

Malware AA

00

<removed>

<removed>

<removed>

Malware BB

00

01

<removed>

<removed>

Malware CC

40% shared

30% shared

50% shared

This process was run several times with

minimum length sizes 2 to 100

Results

Analyst’s dream: Many long common substrings are shared with the larger set

Results

A

B

C

•A - Not too interesting finding common pairs of instructions is

expected and will not reduce redundant analysis by much

Results

A

B

C

•B - Indicates that small traces can be analyzed thus reducing the

larger set analysis by about half

Results

A

B

C

•C - Some reassurance that the dataset was reasonably diverse

Contributions

– The common substring algorithm is capable of identifying similarities in dynamic traces of malware

– Redundant event sequences can be identified to reduce analysis

– Commonalities are not limited to short event sequences

Future Work

– Use behavior templates• For example: regular expressions to identify a

recurring sequences (5 vs. 10 sleep events)– Develop a user interface– Optimization

• GPU

Questions

Sample Common Substrings

Retrieve file from server and replace system file

– Load library

– Connect

– Download

– Check if exists

– Remove

– Copy

– Remove evidence

Dataset Reference

• http://pi1.informatik.uni-mannheim.de/malheur/

top related