Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology Copyright 2011 California Institute of Technology. Government sponsorship acknowledged.
Options for Parallelizing CASPER for a Mesh Processor
Brad Clement, Tara Estlin, Ben Bornstein
Jet Propulsion LaboratoryCalifornia Institute of Technology
Copyright 2011 California Institute of Technology. Government sponsorship acknowledged.
Outline
• Project context• CASPER• Some example designs• Parallelization design choices• Our design
Improving Long-Range Rover Science Using Multi-Core Computing
3
Objectives:• Develop and demonstrate key
capabilities for rover traverse science using multi-core computing
• Adapt three autonomous science technologies to SOA multi-core system
• rock finder (Rockster)
• texture analysis
• continual replanning (CASPER)
• Demonstrate with rover hardware and measure performance benefits using metrics such as execution time and data processed
GNU Free Documentation License
CASPER – Continuous Activity Scheduling Planning Execution and Replanning
• CASPER uses a model of [spacecraft] activities to construct a [mission] plan to achieve [mission] goals while respecting [spacecraft operations] constraints– Example goals: science requests, downlink requests,
maneuver requests– Example constraints: limited memory, power, propellant
• autonomously commanding EO-1 for the past 7 years• automated sequence planning/generation for Orbital
Express• DSN resource allocation• Modified Antarctic Mapping Missions (MAMM)• 60+ models
Schedule Database (SDB)
activities
resource& statetimelines
conflicts
CASPERgoals
state updates commands
SDBSDB
Commandable
Repairer
conflicts
scheduleoperation
execution
CASPER Cycle
1. commandable updates timelines for state updates
2. propagate timelines and constraint networks
3. detail activities for new goals4. check for flaws5. repair/optimize for flaw6. send commands to commandable for execution7. repeat
CASPER data → functions bottlenecks
Planning Data Planning functions
plans/schedules plan and schedule: identify flaws, add/delete search states
activitieso parameters, possible valueso parameter dependencieso parameter/state constraintso reservationso temporal constraints
valid time intervals/orderings
add, delete, constrain, move, detail, abstractget, set, choose valueevaluate dependency func, propagate values, stale?find valid values, violated?, propagate constraintsapply to state/resource timelinesadd, removecompute valid time intervals/orderings
state/resource vars (timelines)o values
compute valid time intervals, identify conflictscompute, propagate, get contributing activities
constraint ruleso conflicts
preference/optimization criteriao scores, deficiencies
identify conflictschoose conflict, choose resolution method (e.g. move)compute scores, identify deficiencieschoose preference to improve
tile/mesh processor
core
core
core
core
core
core
core
core
core core core core
RAM
RAM
cache
proc
switch
Tilera TILE64™8 x 8 cores
Tilera TILE64™ memory accesslocation / event cache
sizepenalty
cycleslinesize
MIPS
best 0 600-1800
branch mis-predict
2 250
L1 8KbI,8KbD
2 16b 250
L2 64Kb 35-49 64b 80
L3 4Mb 8 64b 20
RAM 4Gb 69-88 10
Ungar and Adams, 2009GNU Free Documentation License
score, best schedule
|| stochastic search, star, copied memory
updated schedule
cpu1cpu0
cpu8 cpu9
cpu2
cpu10
score, best schedule
evolutionary search
mutated/crossed schedule with updates
cpu1cpu0
cpu8 cpu9
cpu2
cpu10
memory || by time, master-slave, chain
cpu1cpu0
cpu8 cpu9
SDBSDB
cpu2
cpu10
valid intervals, conflicts
activity Δs, propagation
memory || by time, peer-to-peer
cpu1cpu0
cpu8 cpu9
SDBSDB
cpu2
cpu10
activity Δs, propagation
activity Δs, propagation
00 11
22 33
44
memory || by activity/timeline type, master – slave, star
cpu1cpu0
cpu8 cpu9
SDBSDB
cpu2
cpu10
SDB3SDB3SDB2SDB2
SDB0SDB0 SDB4SDB4SDB1SDB1
valid intervals, conflicts
activity Δs, reservations
memory || by time, peer-to-peer
cpu1cpu0
cpu8 cpu9
cpu2
cpu10
SDB3SDB3SDB2SDB2
SDB0SDB0 SDB4SDB4SDB1SDB1
SDBSDB
activity Δs, propagation
activity Δs, propagation
??
Options: distributing memory
• which data– search space– plans– activities– timelines
• how to partition for load balance• data replication
Lansky (GEMPLAN),Zhou and Hansen, 2007,Burns et al., 2009,Kishimoto et al., 2009
DCR (DCSP & DCOP)
Distributed planning
Options: parallelizing functionLansky (GEMPLAN),Zhou and Hansen, 2007,Burns et al., 2009,Kishimoto et al., 2009
DCR (DCSP & DCOP)
Distributed planning
• which functions– entire algorithm– parts of algorithm
• identifying valid search operations(valid intervals)
• performing a planning/search operation • parameter dependency updates• timeline updates• identifying flaws
– methods of data objects (i.e. distributing memory)– data structure operations
• symmetry (loop-parallelized, master-slave, distributed)
Options: data access and communication
• access location types (processing node, cache, RAM, disk, network/messages)
• allocation control (specify node, specify cache, OS decides)• movement of data• maintain consistency of replicated data (transactions/mutexes,
conflict resolution)• integration of results (transactions/mutexes, conflict resolution)• data routing (centralized, hierarchical, peer-to-peer)• synchronous or asynchronous• communication services (hardware specific, threads, socket, file
I/O, MPI, CORBA, database, distributed planning interfaces)
memory || by activity/timeline type,computation || by conflict type, master-slave, star
cpu1cpu0
cpu8 cpu9
SDBSDB
cpu2
cpu10
SDB3SDB3SDB2SDB2
SDB4SDB4SDB1SDB1
valid intervals, conflicts
activity Δs
reservation Δs
timeline values
Repairer
Repairer
parallelize bottleneck functions parallelize repair/optimize by flaw type
memory distributed timelines dependencies/activities none
load balance strategy dynamic grouping dynamic grouping none needed
replicated data none none none
functions parallelized propagation, valid intervals propagation, conflict gathering
repair, optimize
symmetry peer-to-peer peer-to-peer master-slave, asymmetric by conflict type
data location local cache, pre-specified local cache, pre-specified RAM/cache
data movement none none OS controlled
replicated data none -shared memory none -shared memory none -shared memory
integration shared memory, no conflicts
shared memory, no conflicts determine independence
data routing centralized through cache & RAM
centralized through cache & RAM
centralized through RAM/cache
synchronization synchronize after propagation
full propagate before conflict gathering
sequential processing of dependent conflicts
services Pthreads Pthreads Pthreads
advantages may keep nearly all data in local cache many flaws may be independently addressed
disadvantages local cache may not be large enough for both instructions and data, but that may be unavoidable
difficult to take advantage of locally cached data.
difficult to load balance and maximize utilization
score, best schedule
results on simple || stochastic search
updated schedule
cpu1cpu0
cpu8 cpu9
cpu2
cpu10
11 22
33 44
55
66
77
88
1414
1010
99 1111
1212
1313
1616 1515
Spreading out cores to try and get more efficient access to memory
• At first, just assigned cores from left to right and top down, “all at top”
• Tried strategy to spread cores as far apart as possible while keeping as close as possible to memory controllers, “top and bottom”
summary of experimental results• Ran multiple instances of Casper (1 per core) on same
problem but with different random seeds.– of 20 problems, about half improved with more cores– many slowed down with more cores– several only improved by a few 10s of percent– one improved >60X– more than 8 cores typically doesn’t provide a dramatic increase
in speedup• Tried spreading cores out to see if memory access would
improve– run times vary less and are often slightly better– but more variance actually leads to bigger speedups!
• 3 main computational bottlenecks in Casper; parallelizing one (valid intervals) in map/reduce fashion– full parallelization introduces too much overhead
what’s the plan from here?1. bottleneck 1: figure out how many threads to spawn for valid intervals2. see how homing timeline memory affects performance2. bottleneck 2: parallelize parameter constraint network (PCN) similarly
– map: update function source parameters– reduce: apply function to source parameters and assign to sink parameter– home memory for activities (with their parameters)
3. bottleneck 3: conflict gathering – see if parallelizing PCN also helps this, or apply map-reduce approach
4. parallelize conflict repair– for example, move activity to resolve one conflict while switching another
activity’s resource to resolve another conflict– can mutexes around activities, parameters, and timelines allow
unrestricted parallelization without deadlock?– if not, we will need to actively determine when/where repair operations
can run concurrently5. tuning
– as with valid intervals, need to know how many threads to spawn for these different operations; if homing memory, maybe always spawning to home core is ok
– how to balance memory and threads across cores
energy power * durationpowerForMode(mode)
takeImage.duration
Summary
• A large number of choices may go into a design for a parallelized planning system.
• Presented a hybrid design for parallelizing a continual iterative repair planning system for a tile/mesh multi-core processor.
• Need to characterize design choices by listing what implementation features each would entail.