Top Banner
U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically Identify Common Behaviors
32

U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Mar 27, 2015

Download

Documents

Kayla Armstrong
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

U.S. Army Research, Development and Engineering Command

Jaime C. Acosta, Ph.D.

Using the Longest Common Substring on Dynamic Traces of Malware to Automatically Identify Common Behaviors

Page 2: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Research Question

Can we reduce redundant analysis by finding common behaviors in malware instances?

Page 3: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Malware Analysis

Dynamic Analysis: Run the malware instance (binary) in a controlled environment

– Log all events (registry, memory, sockets, etc.)– Analyze logs for malicious behavior – Find similar malware instances based on

runtime behavior

Page 4: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Malware Analysis

Event Logs

00

01

02

03

Malware A

Event

Codes

Initialize network socket

Establish connection to

malicious.com

Load library

Sleep

Page 5: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Malware Instance Similarity

Event n-grams (Rieck et al. 2010)– Find common n-grams (or sequences of events) in event

logs

01, 02;

02, 03;

2-grams for

Malware A /

Malware B

00

01

02

03

Malware A

01

02

03

02

Malware B

04

02

05

01

02

Malware C

01, 02;

2-grams for

Malware A /

Malware C01, 02;

2-grams for

Malware B /

Malware C

Events

Codes

Page 6: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Malware Instance Similarity

Event n-grams (Rieck 2010)– Find common fixed size n-grams (or sequences of events)

in event logs

Malware A / Malware B are more likely to be of the same “type”

01, 02;

02, 03;

2-grams for

Malware A /

Malware B

00 …

01

02

03

Malware A

01 …

02

03

02

Malware B

04 …

02

05

01

02

Malware C

01, 02;

2-grams for

Malware A /

Malware C01, 02;

2-grams for

Malware B /

Malware C

Page 7: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Malware Instance Similarity

Limitations for post analysis– Lose context given by varied-length sequences

00

01

02

03

04

05

Malware A

Event

Codes

Initialize network socket

Establish connection to

malicious.com

Load library

Sleep

Install a rootkit

Page 8: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Malware Instance Similarity

Limitations for post analysis– Lose context given by varied-length sequences– Lose commonalities between different “types” of

malware

<“Install rootkit”>

08

Malware A

06

04

<“Install rootkit”>

00

01

Malware B

00

<“Install rootkit”>

Malware C

Page 9: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Approach

Common Substrings Algorithm– Based on the Longest Common Substring– Finds all common event sequences of

minimum (not fixed) length n between trace files in a dataset

Page 10: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Approach

Malheur Reference Dataset– Dynamic traces of 3131 malware instances

• Generated with CWSandbox• Trace size ranges from 700B to 3.4MB• Collected in August 2009

Page 11: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Approach

Malheur Reference Dataset– Traces split into 2 sets

Small Set (<100KB) Large Set (>=100KB)Total # malware instance trace files 2,071 1,060Total # events 1,217,985 17,400,262Total size of malware instance trace files 44 MB 490 MB

Page 12: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Approach

Goal– Reduce redundant analysis, especially in larger malware

• First, find common substrings within small malware traces

• Next, reduce analysis workload by removing redundancies in larger malware traces

Page 13: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Approach – Common Substrings Algorithm

Input: Malware dynamic traces of the small set (size < 100KB)

00 …

01

02

03

Malware A

04 …

05

06

02

Malware D

01 …

02

06

02

Malware B

02 …

03

00

04

Malware E

04 …

02

03

00

Malware C

04 …

05

06

00

Malware F

Events

Output:

Common substrings matrix

X X X X X X

… X X X X X

… … X X X X

… … … X X X

… … … … X X

… … … … … X

A B C D E F

A

B

C

D

E

F

All common substrings between

Pairs of malware traces

Page 14: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Approach – Common Substrings Algorithm

Iteration 0

00

01

02

03

Malware A

01

02

06

02

Malware B“” “” “” “”

“” “” “” “”

“” “” “” “”

“” “” “” “”

00 01 02 03

01

02

06

02

Malware A

Ma

lwa

re B

Page 15: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Approach – Common Substrings Algorithm

Iteration 1

00

01

02

03

Malware A

01

02

06

02

Malware B“” “01” “” “”

“” “” “” “”

“” “” “” “”

“” “” “” “”

00 01 02 03

01

02

06

02

Malware A

Ma

lwa

re B

Page 16: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Approach – Common Substrings Algorithm

Iteration 2 – match found, merge with upper left corner

“” “01” “” “”

“” “” “01,02” “”

“” “” “” “”

“” “” “” “”

00 01 02 03

01

02

06

02

Malware A

Ma

lwa

re B

00

01

02

03

Malware A

01

02

06

02

Malware B

Page 17: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Approach – Common Substrings Algorithm

Final Iteration

“” “01” “” “”

“” “” “01,02” “”

“” “” “” “”

“” “” “02” “”

00 01 02 03

01

02

06

02

Malware A

Ma

lwa

re B

We have 2 common substrings.

We only keep those with minimum substring length 2

00

01

02

03

Malware A

01

02

06

02

Malware B

Page 18: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Approach – Common Substrings Algorithm

Selecting which Common Substrings to keep

Common Substrings Matrix“” “01” “” “”

“” “” “01,02” “”

“” “” “” “”

“” “” “02” “”

00 01 02 03

01

02

06

02

Malware A

Ma

lwa

re B

We have 2 common substrings.

We only keep those with minimum substring length 2

X X X X X X01,02

X X X X X

X X X X

X X X

X X

X

A B C D E F

A

B

C

D

E

F

Page 19: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Approach – Common Substrings Algorithm

Unique common substrings are merged

X X X X X X01,0202,03,04 X X X X X03,02,24,46,35

01,0202,03,04 X X X X

03,02,20,40,35

03,02,20,40,3,5

03,02,20,40,3,5 X X X

03,02,24,40,36

03,02,20,40,3,5

03,02,20,40,3,5

03,02,20,40,3,5 X X

01,02,54,409,35

03,02,20,40,3,5

03,02,20,40,3,5

03,02,20,40,3,5

03,02,20,40,3,5 X

A B C D E F

A

B

C

D

E

F

03,02,20,40,35;

03,02,02,02,03;

01,02,02;

00,02;

03,02;

Small set (<100KB)

common substrings

Page 20: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Approach – Common Substrings Algorithm

Doesn’t that take a lot of space?– Many shared common substrings– Total size of all unique common substrings was 25MB

Doesn’t that take a lot of processing time?– Can be run on separate processes with multithreading– GPU

Page 21: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Approach

Find and remove common substrings in large set (size >= 100KB)

03,02,20,40,35;

03,02,02,02,03;

01,02,02;

00,02;

03,02;

Small set (<100KB)

common substrings

00

02

02

03

Malware AA

00

01

02

02

Malware BB

00

01

03

02

Malware CC

<removed>

<removed>

02

03

Malware AA

00

<removed>

<removed>

<removed>

Malware BB

00

01

<removed>

<removed>

Malware CC

40% shared

30% shared

50% shared

Page 22: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Approach

Find and remove common substrings in large set (size >= 100KB)

03,02,20,40,35;

03,02,02,02,03;

01,02,02;

00,02;

03,02;

Small set (<100KB)

common substrings

00

02

02

03

Malware AA

00

01

02

02

Malware BB

00

01

03

02

Malware CC

<removed>

<removed>

02

03

Malware AA

00

<removed>

<removed>

<removed>

Malware BB

00

01

<removed>

<removed>

Malware CC

40% shared

30% shared

50% shared

Average = 40%

Page 23: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Approach

Find and remove common substrings in large set (size >= 100KB)

03,02,20,40,35;

03,02,02,02,03;

01,02,02;

00,02;

03,02;

Small set (<100KB) common

substrings

00

02

02

03

Malware AA

00

01

02

02

Malware BB

00

01

03

02

Malware CC

<removed>

<removed>

02

03

Malware AA

00

<removed>

<removed>

<removed>

Malware BB

00

01

<removed>

<removed>

Malware CC

40% shared

30% shared

50% shared

This process was run several times with

minimum length sizes 2 to 100

Page 24: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Results

Analyst’s dream: Many long common substrings are shared with the larger set

Page 25: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Results

A

B

C

•A - Not too interesting finding common pairs of instructions is

expected and will not reduce redundant analysis by much

Page 26: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Results

A

B

C

•B - Indicates that small traces can be analyzed thus reducing the

larger set analysis by about half

Page 27: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Results

A

B

C

•C - Some reassurance that the dataset was reasonably diverse

Page 28: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Contributions

– The common substring algorithm is capable of identifying similarities in dynamic traces of malware

– Redundant event sequences can be identified to reduce analysis

– Commonalities are not limited to short event sequences

Page 29: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Future Work

– Use behavior templates• For example: regular expressions to identify a

recurring sequences (5 vs. 10 sleep events)– Develop a user interface– Optimization

• GPU

Page 30: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Questions

Page 31: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Sample Common Substrings

Retrieve file from server and replace system file

– Load library

– Connect

– Download

– Check if exists

– Remove

– Copy

– Remove evidence

Page 32: U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically.

Dataset Reference

• http://pi1.informatik.uni-mannheim.de/malheur/