NReduce: A Distributed Virtual Machine for Parallel Graph Reduction Peter Kelly Paul Coddington Andrew Wendelborn Distributed and High Performance Computing.

NReduce:A Distributed Virtual Machine for Parallel Graph Reduction

Peter Kelly

Paul Coddington

Andrew Wendelborn

Distributed and High Performance Computing Group

School of Computer Science

The University of Adelaide

Introduction• Distributed computing middleware

– Utilise a set of machines to execute a collection of jobs– Can be based on desktops, cluster, Internet

• Job organisation is usually either– Independent (task farming)– Dependency graph (workflow)

• Advantages of workflow languages– Implicit parallelism– High-level programming model, hide execution details– Access many different services easily

• Disadvantages of workflow languages– Limited feature set - poor support for control structures, data

manipulation, other forms of computation– This means workflow developers sometimes have to create

additional services or components in other languages to do relatively simple things (e.g. shims - data conversion)

Workflow executionWorkflow graph

Compute node

Compute node

Compute node

Compute node


Compute node

Compute node

Compute node

Compute node

(active)


Compute node

Compute node

Compute node

Compute node

(active) (active) (active)


Compute node

Compute node

Compute node

Compute node

(active)


Compute node

Compute node

Compute node

Compute node

(active) (active)


Compute node

Compute node

Compute node

Compute node

(active)

Related work• Job scheduling (native programs)

– Condor, Xgrid, Sun Grid Engine, Chimera

• Java/.NET-based– OptimalGrid, Alchemi, PAGIS

• Workflow/web service composition– Triana, Taverna, Kepler, BPEL, WSIPL

• Parallel functional programming– (v,G)-Machine, GUM, GHC+SMP

• Message passing libraries & languages– MPI, PVM, Erlang

Problem statement & approach• Want to place more of the application’s logic in the

workflow, rather than other services• Useful for several types of programming

constructs– Data structure access + manipulation– Functional operators: map, filter, reduction– Complex application logic

• Solution - NReduce, our distributed virtual machine– Implements a simple, Turing complete language for

specifying workflows and computation– Based on existing techniques from the parallel

functional programming community

Application node

Function

Data value

@

@

f @

@

g x

h y

Execution model• Workflows often use data flow graphs to specify dependencies• NReduce uses graph reduction, a similar model• Well known technique for implementing functional languages

– Also based on data dependencies

– Supports higher order functions, lazy evaluation, parallelism

– Can be efficiently compiled to native code using existing techniques

f(g(x),h(y))

Application node

Function

Data value

@

@

f @

@

g x

h y





f(g(x),h(y))

Application node

Function

Data value

@

@

f r1

@

h y





f(r1,h(y))

Application node

Function

Data value

@

@

f r1

@

h y





f(r1,h(y))

Application node

Function

Data value

@

@

f r1

r2





f(r1,r2)

Application node

Function

Data value

@

@

f r1

r2





f(r1,r2)

Application node

Function

Data value

r3





r3

Distributed graph reduction

x=A(...)

y=B(...)

f(g(x),h(y))

@

@

f

@

h y

@

g xService A

Service B


@

@

f

@

h y

@

g xService A

Service B

(active)

(active)

x=A(...)

y=B(...)

f(g(x),h(y))


@

@

f

@

h r2

@

g r1Service A

Service B

x=r1

y=r2

f(g(x),h(y))


@

@

f

@

h r2

@

g r1Service A

Service B

x=r1

y=r2

f(g(x),h(y))


@

@

f

r4

r3

Service A

Service B

f(r3,r4)


@

@

f

r4

r3

Service A

Service B

f(r3,r4)


r5

Service A

Service B

r5

Nodes and tasks• Distributed virtual machine consists of multiple nodes

– Organised as a P2P network, with no central point of control– Nodes may be running different operating systems

• Each node contains multiple threads– Task threads (perform graph reduction)– I/O thread, garbage collector, etc.

• Threads communicate using a message passing layer– Provides asynchronous, one-way messaging– Similar to MPI but supports socket connections, dynamic

node joins/departures

• A process is a group of cooperating task threads• Distributed heap maintained across tasks• Each task reduces its own graph segment

Nodes and tasks

Task Task

Task

I/O thread

I/O thread I/O thread

Process

Input language - ELC• Very simple functional programming language• Extended version of lambda calculus, with:

– Arithmetic and conditional operations– Cons lists, strings– Letrec expressions and top-level functions

• Other functionality– Files & network connections - exposed as lists– Efficient list storage, based on arrays

• Intended as an intermediate language to be targeted by other compilers, e.g:– XPath/XSLT/XQuery (currently investigating)– Potentially other workflow languages

Execution• Abstract instruction set based on the (v,G)-machine

(Augustsson & Johnsson)

• Hides parallelism + distribution from programmer

• Bytecode interpreter– Similar to traditional interpreter, but based on call graph instead

of call stack (for concurrency)

• Native code engine– JIT compilation on process startup; targets x86

– Significantly faster than interpreter (though not as fast as C)

• Runtime support– Distributed garbage collector

– Message handler

– Built-in functions

Frame management• Frame = function call

– Like stack frame/function activation record– Organised as a graph; may be >1 active at a time

• Frames are scheduled like processes in an OS• Each frame is in one of four states

– Running: candidate for execution by the processor; scheduled non-preemptively

– Blocked: waiting on an I/O request or result of another call– New: a function call that has not yet begun evaluation

(“suspended evaluation” or “future”)– Sparked: a function call that has not yet started, but is known to

be needed in future; may be migrated to other machines for load balancing purposes

• Advantage: When one function call blocks, others may continue; no need to manually create threads

Work distribution• Dynamic load distribution

– Much simpler and more practical than static scheduling

– Caters for nodes with different processor speeds and other running tasks

– Based on the approach of GUM (Trinder et. al.)

• Work assigned to each task is based on the set of sparked frames available in the process

• Idle tasks send out work requests, asking to be given frames that they can start executing– When the request reaches a task with sparked frames, some of

those frames will be migrated to the requesting task

– Graph pointers are updated with remote references

– Idle task will begin executing the new frames

– Future work requests postponed until task is idle again

Work distribution

R

Busy Idle Idle

Work distribution

R

R

R

R

B

B

B S

S

S

S

Busy Idle Idle

Work distribution

R

R

R

R

B

B

B S

S

S

S

Busy Idle Idle

request work

Work distribution

R

R

R

R

B

B

B S

S

S

S

Busy Idle Idle

request work

Work distribution

R

R

R

R

B

B

B S

S

S

S

Busy Idle Busy

R

R

assign work

Work distribution

R

R

R

R

B

B

B

S

S

Busy Idle Busy

R

R

request work

Work distribution

R

R

R

R

B

B

B

S

S

Busy Busy Busy

R

R

R

R

assign work

B

S

S

SR

Parallelism• NReduce uses strictness analysis to automatically

detect parallelism– Many parallel functional languages instead rely on manual

annotations, to avoid costs of excessive parallelism and sparking

– We trade off a small amount of performance to gain automatic parallelism; acceptable for workflow languages which delegate most compute intensive work to services

– Necessary for higher-level languages with no explicit support for parallelism (e.g. XSLT)

• Manual annotations also supported for certain cases• Sparking is heavily optimised

– One field assignment on spark, one field check on eval

• Side effect free programming model means no need for explicit synchronisation primitives

Streams• Processes may establish TCP connections

– Useful for accessing external services, e.g. web services

• Connections are exposed as data– Input stream: List of bytes, received from other machine

– Output stream: List of bytes (generated by program), sent to other machine

• Parallel execution simplifies handling of multiple connections– When multiple function calls active, each may block and unblock

independently

– Blocking reads or writes to a connection affect only the function call that uses it, not the entire process

– Important for workflows that invoke multiple service operations in parallel

Performance: Sequential

nfib36 mandelbrot

nreduce (native) 87.7 43.8

nreduce (interpreted) 508.0 703.4

C 10.3 18.3

Java 5.1 12.2

Python 435.5 981.2

Perl 724.1 717.0

(8.5xC) (2.4xC)

Performance: Parallel

Current & future work• Performance evaluation

– Currently testing the VM in a range of scenarios

– Comparison with other functional & workflow languages

– Use in context of XSLT + web service composition

• Optimisations– Garbage collection

– Work distribution

• Fault tolerance– Handle node failure within a process by recomputing lost

portions of the graph where possible

• Background utilisation– Suspend node & migrate work when running on workstation

and user becomes active

Conclusion• Our goal is to support complex workflow applications and

parallel computation

• NReduce is a distributed virtual machine, which:– Implements a functional programming model

– Supports transparent parallelism & distribution

– Enables concurrent access to external services

– Is based on a P2P model

– Runs across multiple platforms

• A different approach to workflow engine construction– More like a traditional programming language implementation, but

using ideas from distributed computing

– Provides a powerful + flexible mechanism for writing distributed

applications

NReduce: A Distributed Virtual Machine for Parallel Graph Reduction Peter Kelly Paul Coddington Andrew Wendelborn Distributed and High Performance Computing.

Documents

data flow graphs

data manipulation

functional languagesalso

hyexecution modelworkflows

native code

lazy evaluation

similar modelwell

existing techniquesfgx