Top Banner
Multiverse Multiverse: Automatic Hybridization of Runtime Systems Kyle C. Hale, Conor Hetland, and Peter Dinda | {kh, ch}@u.northwestern.edu, [email protected] A Hybrid Runtime (HRT) is a transformation of a traditional parallel runtime into a specialized operating system kernel. HRTs enjoy unfettered access to the hardware and determine their own abstractions to that hardware. The Hybrid Virtual Machine (HVM) makes it possible to create VMs that are internally partitioned between a “regular OS” (ROS) and an HRT. They allow the HRT to leverage legacy functionality inside the ROS, and they allow a user to easily create and launch HRTs from the ROS. Hybrid Runtimes HRTs can be very fast, but they require a manual port to kernel mode. This requires domain knowledge at the level of a runtime developer and at the level of a kernel developer Even for an experienced kernel developer, porting a complex parallel runtime to kernel-mode is an error-prone process. Porting can be difficult and laborious! Much of this functionality is not on critical path! Why Automatic Hybridization? [1] K. Hale, C. Hetland, and P. Dinda. Automatic Hybridization of Runtime Systems. In Proceedings of the 25 th International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC ‘16). [2] K. Hale and P. Dinda. Enabling Hybrid Parallel Runtimes Through Kernel and Virtualization Support. In Proceedings of the 12 th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE ‘16). [3] K. Hale and P. Dinda. A Case for Transforming Parallel Runtimes into Operating System Kernels. In Proceedings of the 24 th International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC ‘15). [4] J. Lange, P. Dinda, K. Hale, L. Xia. An Introduction to the Palacios Virtual Machine Monitor Release 1.3. Tech. Rep. NWU-EECS-11-10, Dept. of EECS, Northwestern Univ. (2011). References Acknowledgements This project is made possible by support from the United States National Science Foundation through grant CCF-1533560 and from Sandia National Laboratories through the Hobbes Project, which is funded by the 2013 Exascale Operating and Runtime Systems Program under the Office of Advanced Scientific Computing Research in the United States Department Of Energy’s Office of Science. Racket is the most widely used dialect of Scheme Includes challenging features typical of a dynamic, high-level language. Many make heavy use of Linux ABI: system calls, memory mapping, processes, threads, signals, etc. We automatically hybridize Racket with Multiverse. The user can interact with the Racket REPL in the standard fashion Small Performance Overhead With Multiverse, when a new HRT context is created, the Aerokernel is booted transparently on a remote set of cores Right: The Aerokernel binary is included in the runtime’s executable when compiled with our toolchain The boot initialization is requested by the Multiverse runtime layer on the ROS side Reducing Forwarded Events We introduced Multiverse, a system that automatically hybridizes existing runtime systems Runtime developers rebuild their system with our toolchain. It can then operate in a state of split execution, where most of the execution occurs in an accelerated, HRT environment Multiverse adds little to no overhead, allowing the developer to start with a working system in kernel mode. The developer can then incrementally port legacy functionality to the HRT, reducing the number of events forwarded to the ROS Summary split-execution in Multiverse merged address space Merged address space allows HRT to leverage code/ data mapped into the ROS virtual address space We can, for example, use shared user-space libraries in the HRT that are mapped into the ROS process without implementing dynamic linking functionality in the Aerokernel The HRT can operate on data structures that have been constructed in the ROS Higher-half addresses (where the kernel code/data is mapped) are distinct for ROS and HRT Left: Performance of hybridized Racket (with Multiverse) for a set of benchmarks from the Language Benchmark Game compared to Racket in a VM and Racket running on native Linux. Overheads are very small. In all but two cases, the light-weight environment provided by the HRT actually increases performance over Virtual. Right: The primary source of overhead in Multiverse comes from forwarded events. The two benchmarks above that are made slower from overhead incur most of it from page faults. The figure on the right shows that the overheads for typical forwarded events are roughly 1500 cycles for each event. These bars show the benchmarks that perform worse with Multiverse initially We introduce a change to the Racket runtime that eagerly faults in pages when mapping large chunks of memory This reduces the occurrence of page faults, which in turn reduces the number of events forwarded from HRT to ROS Performance of the hybridized version of Racket is now better than virtual The point of this exercise is to show that the overheads of Multiverse can be eliminated by reducing the number of forwarded events General Purpose OS (Linux) HVM Library Parallel Runtime Parallel App AeroKernel (Nautilus) HVM Library Parallel Runtime Parallel App merged address space VMM accelerated HRT (1) (2) (3) (4) ROS HRT Main thread Partner thread HRT thread Nested HRT thread (1) (2) (3) (4) (5) libs
1

Multiverse: Automatic Hybridization of Runtime Systemsv3vee.org/talks/gcasr16.pdf · Multiverse Multiverse: Automatic Hybridization of Runtime Systems Kyle C. Hale, Conor Hetland,

Oct 05, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Multiverse: Automatic Hybridization of Runtime Systemsv3vee.org/talks/gcasr16.pdf · Multiverse Multiverse: Automatic Hybridization of Runtime Systems Kyle C. Hale, Conor Hetland,

Multiverse

Multiverse: Automatic Hybridization of Runtime Systems Kyle C. Hale, Conor Hetland, and Peter Dinda | {kh, ch}@u.northwestern.edu, [email protected]

A Hybrid Runtime (HRT) is a transformation of a traditional parallel runtime into a specialized operating system kernel. HRTs enjoy unfettered access to the hardware and determine their own abstractions to that hardware. The Hybrid Virtual Machine (HVM) makes it possible to create VMs that are internally partitioned between a “regular OS” (ROS) and an HRT. They allow the HRT to leverage legacy functionality inside the ROS, and they allow a user to easily create and launch HRTs from the ROS.

Hybrid Runtimes

•  HRTs can be very fast, but they require a manual port to kernel mode. This requires domain knowledge at the level of a runtime developer and at the level of a kernel developer

•  Even for an experienced kernel developer, porting a complex parallel runtime to kernel-mode is an error-prone process. Porting can be difficult and laborious!

•  Much of this functionality is not on critical path!

Why Automatic Hybridization?

[1] K. Hale, C. Hetland, and P. Dinda. Automatic Hybridization of Runtime Systems. In Proceedings of the 25th

International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC ‘16). [2] K. Hale and P. Dinda. Enabling Hybrid Parallel Runtimes Through Kernel and Virtualization Support. In

Proceedings of the 12th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE ‘16).

[3] K. Hale and P. Dinda. A Case for Transforming Parallel Runtimes into Operating System Kernels. In Proceedings

of the 24th International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC ‘15).

[4] J. Lange, P. Dinda, K. Hale, L. Xia. An Introduction to the Palacios Virtual Machine Monitor Release 1.3. Tech.

Rep. NWU-EECS-11-10, Dept. of EECS, Northwestern Univ. (2011).

References

Acknowledgements This project is made possible by support from the United States National Science Foundation through grant CCF-1533560 and from Sandia National Laboratories through the Hobbes Project, which is funded by the 2013 Exascale Operating and Runtime Systems Program under the Office of Advanced Scientific Computing Research in the United States Department Of Energy’s Office of Science.

•  Racket is the most widely used dialect of Scheme •  Includes challenging features typical of a dynamic, high-level language. Many make heavy use of

Linux ABI: system calls, memory mapping, processes, threads, signals, etc. •  We automatically hybridize Racket with Multiverse. The user can interact with the Racket REPL in

the standard fashion

Small Performance Overhead

•  With Multiverse, when a new HRT context is created, the Aerokernel is booted transparently on a remote set of cores

•  Right: The Aerokernel binary is included in the runtime’s executable when compiled with our toolchain

•  The boot initialization is requested by the Multiverse runtime layer on the ROS side

Reducing Forwarded Events

•  We introduced Multiverse, a system that automatically hybridizes existing runtime systems

•  Runtime developers rebuild their system with our toolchain. It can then operate in a state of split execution, where most of the execution occurs in an accelerated, HRT environment

•  Multiverse adds little to no overhead, allowing the developer to start with a working system in kernel mode. The developer can then incrementally port legacy functionality to the HRT, reducing the number of events forwarded to the ROS

Summary

split-execution in Multiverse

merged address space

•  Merged address space allows HRT to leverage code/data mapped into the ROS virtual address space

•  We can, for example, use shared user-space libraries in the HRT that are mapped into the ROS process without implementing dynamic linking functionality in the Aerokernel

•  The HRT can operate on data structures that have been constructed in the ROS

•  Higher-half addresses (where the kernel code/data is mapped) are distinct for ROS and HRT

Left: Performance of hybridized Racket (with Multiverse) for a set of benchmarks from the Language Benchmark Game compared to Racket in a VM and Racket running on native Linux. Overheads are very small. In all but two cases, the light-weight environment provided by the HRT actually increases performance over Virtual.

Right: The primary source of overhead in Multiverse comes from forwarded events. The two benchmarks above that are made slower from overhead incur most of it from page faults. The figure on the right shows that the overheads for typical forwarded events are roughly 1500 cycles for each event.

•  These bars show the benchmarks that perform worse with Multiverse initially

•  We introduce a change to the Racket runtime that eagerly faults in pages when mapping large chunks of memory

•  This reduces the occurrence of page faults, which in turn reduces the number of events forwarded from HRT to ROS

•  Performance of the hybridized version of Racket is now better than virtual

•  The point of this exercise is to show that the overheads of Multiverse can be eliminated by reducing the number of forwarded events

General Purpose OS(Linux)

HVM Library

Parallel Runtime

Parallel App

AeroKernel (Nautilus)

HVM Library

Parallel Runtime

Parallel App

merged address space

VMM

accelerated HRT

(1)

(2)

(3)

(4)

ROS HRT

Main thread

Partner thread

HRT thread

Nested HRT

thread

(1)

(2) (3)

(4)(5)

libs

AeroKernel binary

stack

Virtual address spacefor control process (ROS core)

libs.text.data

heap AeroKernel-managedmemory

HRT core physical memory

Loaded AeroKernel

Multibootand

VMM info structures

ROS Kernel

(Linux)

Application + Runtime Code and

Data

0xffff800000000000

0xffffffffffffffff

0x00007fffffffffff

0x0000000000000000

Canonical “lower half”

Canonical “higher half”

Application + Runtime Code and

Data

ROS Virtual Address Space

HRT Virtual Address Space

Physical Address Space

HRT Private

ROS + HRT Shared

ROS + HRT Shared

HRT Private

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

mmap

munmap

Late

ncy

(cyc

les)

VirtualMultiverse

0

10

20

30

40

50

fannk

uch-r

edux

binary

-tree-2 fas

tafas

ta-3

nbod

y

spec

tral-n

orm

mande

lbrot-

2

Run

time

(s)

NativeVirtual

Multiverse

0

10

20

30

40

50

60

nbod

y

spec

tral-n

orm

Run

time

(s)

Native-prefaultVirtual-prefault

Multiverse-prefault

•  In Multiverse, the runtime begins execution in the ROS. The runtime creates an HRT context through either explicit or implicit invocations

•  Once an HRT context is created, the system is in a state of split execution

•  During split execution, exceptional events on the HRT side (page faults, system calls, and some others) are forwarded to the ROS

•  Each HRT execution context is paired with a partner thread on the ROS, which handles events forwarded over event channels. HRT contexts with their partner threads comprise execution groups

•  Nested threads share event channels with their parent

•  HRT contexts are (by default) created whenever a new pthread is created in the runtime

Parallel&App&

Parallel&Run,me&

General&Kernel&

Node&HW&

User%Mode%

Kernel%Mode%

Parallel&App&

Hybrid&Run,me&(HRT)&

Node&HW&

User%Mode%

Kernel%Mode%

Parallel&Run,me&

General&Kernel&

Node&HW&

User%Mode%

Kernel%Mode%

Parallel&App&

Hybrid&Run,me&(HRT)&

User%Mode%

Kernel%Mode%

Hybrid&Virtual&Machine&(HVM)&

Specialized&Virtualiza,on&Model&

Performan

ce*Path*

Parallel&App&

Legacy*Path*

(a) Current Model (b) Hybrid Runtime Model

(c) Hybrid Runtime Model Within a Hybrid Virtual Machine

Performan

ce*Path*

General&Virtualiza,on&Model&

•  We showed in previous work that by porting a legacy parallel runtime to an HRT environment, we can increase the performance of a real parallel runtime system by as much as 40% [2, 3]

•  The HRT is composed of the runtime and a thin kernel framework layer called an Aerokernel

•  Aerokernels are designed to be simple, light-weight, and very fast. We designed and implemented the Nautilus Aerokernel, which is used in conjunction with Multiverse

overhead added for forwarded system calls

~1500 cycles

interactions within an execution group

Aerokernel boot process

Building an Aerokernel to support a parallel runtime system (manual port to HRT)

addfunc(on rebuild boot works?

no

yesdone

http://nautilus.halek.co

fast path on HW

fast path undervirtualization

Needaneasierwaytogofromlegacyrun4mesystemtoHRT+HVM-capablerun4me