Top Banner
ast Dynamic Binary Translat for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi
49

Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Dec 24, 2015

Download

Documents

Sydney Morton
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Fast Dynamic Binary Translationfor the Kernel

Piyus Kedia and Sorav BansalIIT Delhi

Page 2: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Applications of Dynamic Binary Translation (DBT)

OS Virtualization Testing and Verification of Compiled Programs Profiling and Debugging Software Fault Isolation Dynamic Optimizations Program Shepherding … and more

Page 3: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

A Short Introduction toDynamic Binary Translation (DBT)

Dispatcher

Execute Block

Start

Translate Block

Native codeBlock terminates with branch to dispatcher instruction

Page 4: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Code Cache

Dispatcher

cached?

Execute fromCode Cache

Start

Translate Block

Native code

Store in code cache

no

yes

Page 5: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

DBT Overheads

• User-level DBT well understood• Near-native performance for application-level workloads

• DBT for the Kernel requires more mechanisms• Efficiently handling exceptions and interrupts• Case studies:

• VMware’s Software Virtualization• DynamoRio-Kernel (DRK) [ASPLOS ’12]

Page 6: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Interposition on Starting (Entry) Points

Dispatcher

cached?

Execute fromCode Cache

Start

Translate Block

Native code

Store in code cache

no

yes

Start

Page 7: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

IDT now points to the dispatcher

Dispatcher

cached?

Execute fromCode Cache

Translate Block

Native code

Store in code cache

no

yes

Interrupt Descriptor Table

Page 8: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

What does the dispatcher do?

Before transferring control to the code cache, the dispatcher:1. Converts interrupt state on stack to native values (e.g., PC)

Page 9: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

What does the dispatcher do?

Before transferring control to the code cache, the dispatcher:1. Converts interrupt state on stack to native values (e.g., PC)

CS registerPC

Flags

Guest Stack

SP

CS registerNative PC

Flags

Guest Stack

SP

Page 10: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

What does the dispatcher do?

Before transferring control to the code cache, the dispatcher:

2. Emulates Precise Exceptions1. Converts interrupt state on stack to native values (e.g., PC)

Page 11: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

What does the dispatcher do?

Before transferring control to the code cache, the dispatcher:1. Converts interrupt state on stack to native values (e.g., PC)2. Emulates Precise Exceptions

Precise ExceptionsBefore the execution of an exception handler, all instructions up to the executing instruction should have executed, and everything afterwards

must not have executed.

Page 12: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

What does the dispatcher do?

Before transferring control to the code cache, the dispatcher:1. Converts interrupt state on stack to native values (e.g., PC)2. Emulates Precise Exceptions

• Rolls back partially executed translations

Precise Exceptions

pushstoreadd sub load mov pop

Executed Exception handlerexecutes

Page 13: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

What does the dispatcher do?

Before transferring control to the code cache, the dispatcher:1. Converts interrupt state on stack to native values (e.g., PC)2. Emulates Precise Exceptions

• Rollback partially executed translations3. Emulates Precise Interrupts

Page 14: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

What does the dispatcher do?

Before transferring control to the code cache, the dispatcher:1. Converts interrupt state on stack to native values (e.g., PC)2. Emulates Precise Exceptions

• Rollback partially executed translations3. Emulates Precise Interrupts

Precise Interrupts

pushstoreadd sub load mov pop

Interrupt handlerexecutes

Executed

• Delays interrupt delivery till start of next native instruction

Page 15: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Effect on Performance

Applications with high interrupt and exception activityexhibit large DBT overheads

Page 16: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

VMware’s Software Virtualization Overheads

SpecInt kernel-compile apache0

20

40

60

80

100

120

140

2.9

27.11

123.48

Perc

enta

ge O

verh

ead

over

Nati

ve

benchmarks

Data from “Comparison of Software and Hardware Techniques for x86 Virtualization”K. Adams, O. Agesen, VMware, ASPLOS 2006.

Page 17: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

VMware’s Software Virtualization Overheads

SpecInt kernel-compile apache 2D-graphics large-RAM forkwait0

100

200

300

400

500

600

700

2.927.11

123.48

57.8191.68

603.44

Perc

enta

ge O

verh

ead

over

Nati

ve

benchmarks -m benchmarks

Data from “Comparison of Software and Hardware Techniques for x86 Virtualization”K. Adams, O. Agesen, VMware, ASPLOS 2006.

Page 18: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

VMware’s Software Virtualization Overheads

SpecInt kernel-compile apache 2D-graphics large-RAM forkwait divzero syscall0

100

200

300

400

500

600

700

800

900

2.9 27.11

123.4857.81

91.68

603.44

262.54

853.72

Perc

enta

ge O

verh

ead

over

Nati

ve

benchmarks -m benchmarks nano-benchmarks

Data from “Comparison of Software and Hardware Techniques for x86 Virtualization”K. Adams, O. Agesen, VMware, ASPLOS 2006.

Page 19: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Dynamo-Rio Kernel (DRK) Overheads

Data from “Comprehensive Kernel Instrumentation via Dynamic Binary Translation”P. Feiner, A.D. Brown, A. Goel, U. Toronto, ASPLOS 2012.

fileserver webserver webproxy varmail apachebench0

50

100

150

200

250

300

350

400

212.3

351.85325.37

44.44

184.13

Perc

enta

ge O

verh

ead

over

Nati

ve

Page 20: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

DRK vs BTKernel

fileserver webserver webproxy varmail apachebench0

50

100

150

200

250

300

350

400

212.3

351.85325.37

44.44

184.13

0.36 2.19 2.44 10.6 0.42

Perc

enta

ge O

verh

ead

over

Nati

ve

Page 21: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Fully Transparent Execution is not required

• The OS kernel rarely relies on precise exceptions

• The OS kernel rarely relies on precise interrupts

• The OS kernel seldom inspects the PC address pushed on stack. It is only used at the time of returning from interrupt using the iret instruction.

Page 22: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Faster Execution is Possible

• Leave code cache addresses in kernel stacks.• An interrupt/exception directly jumps into the code cache, bypassing the

dispatcher.

• Allow imprecise interrupts and exceptions.

• Handle special cases specially.

Page 23: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

IDT now points to the code cache

Dispatcher

cached?

Execute fromCode Cache

Translate Block

Native code

Store in code cache

no

yes

Interrupt Descriptor Table

Page 24: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

IDT now points to the code cache

Dispatcher

cached?

Execute fromCode Cache

Translate Block

Native code

Store in code cache

no

yes

Interrupt Descriptor Table

Page 25: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Correctness Concerns

1. Read / Write of the interrupted PC address on stack will return incorrect values.• Fortunately, this is rare in practice and can be handled specially

Page 26: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Read of an interrupted PC address

CS registertranslated PC

Flags

Guest Stack

SPload addr

Examples:

1. Exception Tables in Linux page fault handler

Page 27: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Exception Tables in Linux

• Page faults are allowed in certain functions• e.g., copy_from_user(), copy_to_user().

• An exception table is constructed at compile time• contains the range of PC addresses that are allowed to page fault.

• At runtime, the faulting PC value is compared against the exception table• Panic only if PC not present in exception table

Page 28: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Read of an Interrupted PC address

CS registertranslated PC

Flags

Guest Stack

SPload addr

Problem:The faulting PC value is now a code-cache address.

Solution:Dispatcher adds potentially faulting code cache addresses to the exception table

Page 29: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Read of an Interrupted PC address

CS registertranslated PC

Flags

Guest Stack

SPload addr

Examples:

1. Exception Tables in Linux

2. MS Windows NT Structured Exception Handling__try / __except constructs in C/C++

Page 30: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

__try / __except blocks in MS Windows NT

__try { <potentially faulting code>} __except { <fault handler>}

__try { copy_from_user();} __except { signal_process()}

Syntax: Example Usage:

Also implemented using exception tables in the Windows kernel

Page 31: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

More examples in paperIn our experience, all such cases can be nicely handled!

Page 32: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Correctness Concerns

1. Read / Write of the faulting PC address on stack will return incorrect values.

2. Code-cache addresses will now live in kernel stacks.• What if code-cache addresses become invalid?

Page 33: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Code Cache Addresses can now live in Kernel Data Structures

CS registertranslated PC

Flags

Thread 1 Stack

SPCode Cache

Thread 2 Stack

SPContext Switch

Page 34: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Code Cache Addresses can now live in Kernel Data Structures

• Disallow Cache Replacement• Code Cache of around 10MB suffices for Linux

• Do not move or modify code cache blocks, once they are created• Ensures that a code cache address remains valid for the execution lifetime

• If the code cache gets full, switchoff and switch-back on the translator• Switchoff implemented by reverting to original IDT and other entry points.• This results in effectively flushing the code cache and starting afresh

Page 35: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Dynamic Switchon / Switchoff

• Replace all entry points with shadow / original values• e.g., for switchoff, replace shadow interrupt descriptor table with original

• Iterate over the kernel’s list of threads• Identify PC values in thread stacks and convert them to code cache / native

values

• Translator reboot (switchoff followed by switchon) flushes the code cache

Page 36: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Correctness Concerns

1. Read / Write of the faulting PC address on stack will return incorrect values.

2. Code-cache addresses will now live in kernel stacks. What if code-cache addresses become invalid?

3. Imprecise Interrupts and Exceptions.

Page 37: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Imprecise Exceptions and Interrupts

Interestingly, an OS kernel typically never depends on precise exceptions and interrupts.

Page 38: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Reentrancy and Concurrency

Direct entries into the code cache introduce new reentrancy and concurrency issues

Detailed discussion in the paper.

Page 39: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Optimizations that worked

• L1 cache-aware Code Cache Layout

• Function call/return optimization

Page 40: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Code Cache Layout for Direct Branch Chaining

Dispatcher

Code Cache Edge Cache

Edge code:• executed only once, on the first execution of the block.• However, shares the same cache lines as all other code.Allocate edge code from a separate memory pool for better cache locality.

Edge code for branching to dispatcher

Page 41: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Function call/return optimization

Use identity translations for ‘call’ and ‘ret’ instructionsinstead of treating ‘ret’ as another indirect branch.

Involves careful handling of call instructions with indirect targets(discussed in the paper)

Page 42: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Experiments

• BTKernel Performance vs. Native

• BTKernel Statistics

• Experience with some applications

Page 43: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Apache 1, 2, 4, 8 and 12 processors

1 2 4 8 120

2000

4000

6000

8000

10000

12000

14000

Native BTKernel BTKernel-no-callret

Number of Processors

Thro

ughp

ut (M

Bps)

Higher is better

Page 44: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Fileserver 1, 4, 8, 12 processors

1 4 8 120

500010000150002000025000300003500040000

Native BTKernel BTKernel-no-callret

Number of Processors

Thro

ughp

ut (o

ps/s

)

Higher is better

Page 45: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

lmbench fork operations

execve exit sh0

200400600800

100012001400

Native BTKernel BTKernel-no-callret

lmbench microbenchmark

Tim

e (M

icro

seco

nds)

Lower is better

Page 46: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Number of Dispatcher Exits

Without call/ret optimization

With call/ret optimization

Instructions Dispatcher Exits

Instructions Dispatcher Exits

Apache 56 b 7 m 59 b 125

Linux Build 570 b 72 m 590 b 33059

m = millionb = billion

Page 47: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Applications

• We implemented Shadow Memory for a Linux guest• Identifies the CPU-private (read/write) and CPU-shared (read/write) bytes in

kernel address space

• Overheads range from 20% - 300%

• Significant improvement over the 10x overheads reported in DRK

Page 48: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Summary and Conclusion

• Avoid back-and-forth translation between native and translated values of interrupted PC• Relax precision requirements on exceptions and interrupts• Use cache-aware layout for the code cache• Use identity translations for the function call/ret instructions

Near-Native performance DBT implementation for unmodified LinuxAvailability: https://github.com/piyus/btkernel

Page 49: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Thank You.