Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.

Options for Parallelizing CASPER for a Mesh Processor

Brad Clement, Tara Estlin, Ben Bornstein

Jet Propulsion LaboratoryCalifornia Institute of Technology

Copyright 2011 California Institute of Technology. Government sponsorship acknowledged.

Outline

• Project context• CASPER• Some example designs• Parallelization design choices• Our design

Improving Long-Range Rover Science Using Multi-Core Computing

3

Objectives:• Develop and demonstrate key

capabilities for rover traverse science using multi-core computing

• Adapt three autonomous science technologies to SOA multi-core system

• rock finder (Rockster)

• texture analysis

• continual replanning (CASPER)

• Demonstrate with rover hardware and measure performance benefits using metrics such as execution time and data processed

GNU Free Documentation License

http://en.wikipedia.org/wiki/GNU_Free_Documentation_License

CASPER – Continuous Activity Scheduling Planning Execution and Replanning

• CASPER uses a model of [spacecraft] activities to construct a [mission] plan to achieve [mission] goals while respecting [spacecraft operations] constraints– Example goals: science requests, downlink requests,

maneuver requests– Example constraints: limited memory, power, propellant

• autonomously commanding EO-1 for the past 7 years• automated sequence planning/generation for Orbital

Express• DSN resource allocation• Modified Antarctic Mapping Missions (MAMM)• 60+ models

Schedule Database (SDB)

activities

resource& statetimelines

conflicts

CASPERgoals

state updates commands

SDBSDB

Commandable

Repairer

conflicts

scheduleoperation

execution

CASPER Cycle

1. commandable updates timelines for state updates

2. propagate timelines and constraint networks

3. detail activities for new goals4. check for flaws5. repair/optimize for flaw6. send commands to commandable for execution7. repeat

CASPER data → functions bottlenecks

Planning Data Planning functions

plans/schedules plan and schedule: identify flaws, add/delete search states

activitieso parameters, possible valueso parameter dependencieso parameter/state constraintso reservationso temporal constraints

valid time intervals/orderings

add, delete, constrain, move, detail, abstractget, set, choose valueevaluate dependency func, propagate values, stale?find valid values, violated?, propagate constraintsapply to state/resource timelinesadd, removecompute valid time intervals/orderings

state/resource vars (timelines)o values

compute valid time intervals, identify conflictscompute, propagate, get contributing activities

constraint ruleso conflicts

preference/optimization criteriao scores, deficiencies

identify conflictschoose conflict, choose resolution method (e.g. move)compute scores, identify deficiencieschoose preference to improve

tile/mesh processor

core

core

core

core

core

core

core

core

core core core core

RAM

RAM

cache

proc

switch

Tilera TILE64™8 x 8 cores

Tilera TILE64™ memory accesslocation / event cache

sizepenalty

cycleslinesize

MIPS

best 0 600-1800

branch mis-predict

2 250

L1 8KbI,8KbD

2 16b 250

L2 64Kb 35-49 64b 80

L3 4Mb 8 64b 20

RAM 4Gb 69-88 10

Ungar and Adams, 2009GNU Free Documentation License

http://en.wikipedia.org/wiki/GNU_Free_Documentation_License

score, best schedule

|| stochastic search, star, copied memory

updated schedule

cpu1cpu0

cpu8 cpu9

cpu2

cpu10


evolutionary search

mutated/crossed schedule with updates

cpu1cpu0

cpu8 cpu9

cpu2

cpu10

memory || by time, master-slave, chain

cpu1cpu0

cpu8 cpu9

SDBSDB

cpu2

cpu10

valid intervals, conflicts

activity Δs, propagation

memory || by time, peer-to-peer

cpu1cpu0

cpu8 cpu9

SDBSDB

cpu2

cpu10



00 11

22 33

44

memory || by activity/timeline type, master – slave, star

cpu1cpu0

cpu8 cpu9

SDBSDB

cpu2

cpu10

SDB3SDB3SDB2SDB2

SDB0SDB0 SDB4SDB4SDB1SDB1


activity Δs, reservations

memory || by time, peer-to-peer

cpu1cpu0

cpu8 cpu9

cpu2

cpu10

SDB3SDB3SDB2SDB2

SDB0SDB0 SDB4SDB4SDB1SDB1

SDBSDB



??

Options: distributing memory

• which data– search space– plans– activities– timelines

• how to partition for load balance• data replication

Lansky (GEMPLAN),Zhou and Hansen, 2007,Burns et al., 2009,Kishimoto et al., 2009

DCR (DCSP & DCOP)

Distributed planning

Options: parallelizing functionLansky (GEMPLAN),Zhou and Hansen, 2007,Burns et al., 2009,Kishimoto et al., 2009

DCR (DCSP & DCOP)

Distributed planning

• which functions– entire algorithm– parts of algorithm

• identifying valid search operations(valid intervals)

• performing a planning/search operation • parameter dependency updates• timeline updates• identifying flaws

– methods of data objects (i.e. distributing memory)– data structure operations

• symmetry (loop-parallelized, master-slave, distributed)

Options: data access and communication

• access location types (processing node, cache, RAM, disk, network/messages)

• allocation control (specify node, specify cache, OS decides)• movement of data• maintain consistency of replicated data (transactions/mutexes,

conflict resolution)• integration of results (transactions/mutexes, conflict resolution)• data routing (centralized, hierarchical, peer-to-peer)• synchronous or asynchronous• communication services (hardware specific, threads, socket, file

I/O, MPI, CORBA, database, distributed planning interfaces)

memory || by activity/timeline type,computation || by conflict type, master-slave, star

cpu1cpu0

cpu8 cpu9

SDBSDB

cpu2

cpu10

SDB3SDB3SDB2SDB2

SDB4SDB4SDB1SDB1


activity Δs

reservation Δs

timeline values

Repairer

Repairer

parallelize bottleneck functions parallelize repair/optimize by flaw type

memory distributed timelines dependencies/activities none

load balance strategy dynamic grouping dynamic grouping none needed

replicated data none none none

functions parallelized propagation, valid intervals propagation, conflict gathering

repair, optimize

symmetry peer-to-peer peer-to-peer master-slave, asymmetric by conflict type

data location local cache, pre-specified local cache, pre-specified RAM/cache

data movement none none OS controlled

replicated data none -shared memory none -shared memory none -shared memory

integration shared memory, no conflicts

shared memory, no conflicts determine independence

data routing centralized through cache & RAM

centralized through cache & RAM

centralized through RAM/cache

synchronization synchronize after propagation

full propagate before conflict gathering

sequential processing of dependent conflicts

services Pthreads Pthreads Pthreads

advantages may keep nearly all data in local cache many flaws may be independently addressed

disadvantages local cache may not be large enough for both instructions and data, but that may be unavoidable

difficult to take advantage of locally cached data.

difficult to load balance and maximize utilization


results on simple || stochastic search

updated schedule

cpu1cpu0

cpu8 cpu9

cpu2

cpu10

11 22

33 44

55

66

77

88

1414

1010

99 1111

1212

1313

1616 1515

Spreading out cores to try and get more efficient access to memory

• At first, just assigned cores from left to right and top down, “all at top”

• Tried strategy to spread cores as far apart as possible while keeping as close as possible to memory controllers, “top and bottom”

summary of experimental results• Ran multiple instances of Casper (1 per core) on same

problem but with different random seeds.– of 20 problems, about half improved with more cores– many slowed down with more cores– several only improved by a few 10s of percent– one improved >60X– more than 8 cores typically doesn’t provide a dramatic increase

in speedup• Tried spreading cores out to see if memory access would

improve– run times vary less and are often slightly better– but more variance actually leads to bigger speedups!

• 3 main computational bottlenecks in Casper; parallelizing one (valid intervals) in map/reduce fashion– full parallelization introduces too much overhead

what’s the plan from here?1. bottleneck 1: figure out how many threads to spawn for valid intervals2. see how homing timeline memory affects performance2. bottleneck 2: parallelize parameter constraint network (PCN) similarly

– map: update function source parameters– reduce: apply function to source parameters and assign to sink parameter– home memory for activities (with their parameters)

3. bottleneck 3: conflict gathering – see if parallelizing PCN also helps this, or apply map-reduce approach

4. parallelize conflict repair– for example, move activity to resolve one conflict while switching another

activity’s resource to resolve another conflict– can mutexes around activities, parameters, and timelines allow

unrestricted parallelization without deadlock?– if not, we will need to actively determine when/where repair operations

can run concurrently5. tuning

– as with valid intervals, need to know how many threads to spawn for these different operations; if homing memory, maybe always spawning to home core is ok

– how to balance memory and threads across cores

energy power * durationpowerForMode(mode)

takeImage.duration

Summary

• A large number of choices may go into a design for a parallelized planning system.

• Presented a hybrid design for parallelizing a continual iterative repair planning system for a tile/mesh multi-core processor.

• Need to characterize design choices by listing what implementation features each would entail.

Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.

Documents