Mapping Task Graphs to Processors in Mapping Task Graphs to Processors in Large Large Multiprocessor Systems Multiprocessor Systems Kurt Keutzer Kurt Keutzer and the MESCAL Team and the MESCAL Team especially especially Yujia Jin, Kaushik Ravindran, and Yujia Jin, Kaushik Ravindran, and N. R. Satish N. R. Satish
22
Embed
Mapping Task Graphs to Processors in Large Multiprocessor Systems Mapping Task Graphs to Processors in Large Multiprocessor Systems Kurt Keutzer and the.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Mapping Task Graphs to Processors in LargeMapping Task Graphs to Processors in LargeMultiprocessor SystemsMultiprocessor Systems
Kurt KeutzerKurt Keutzer
and the MESCAL Teamand the MESCAL Team
especiallyespecially
Yujia Jin, Kaushik Ravindran, and N. R. SatishYujia Jin, Kaushik Ravindran, and N. R. Satish
04/18/23 2
FromDevice(0)Discard
ToDevice(0)
FromDevice(1)
FromDevice(2)
FromDevice(3)
Discard
ToDevice(1)
ToDevice(2)
ToDevice(3)
Discard
…
FromDevice(15)
LookupIPRoute
ToDevice(15)
… …
IPVerify DecIPTTL
DiscardDiscard
IPVerifyDecIPTTL
Discard
DiscardIPVerify
DecIPTTL
…
Discard
DecIPTTL
Discard
DecIPTTL
Design Space Exploration FlowDesign Space Exploration Flow
MicroBlaze (soft)
FSL
OPB
PLB
Hardware acceleration
Ethernet
Off-chip SDRAM
On-chip BRAM
PECo-PE PECo-PE
MEM MEMMEM PECo-PE
MEM
PERIPHERALMEM
Multiprocessorplatform
Application Application descriptiondescription
PerformancePerformanceAnalysisAnalysis
PerformancePerformanceNumbersNumbers
Task graph
HW/SW generation
Implementation
Task Graph + profiles
Allocation/SchedulingPlatform
ConstraintsSchedulingConstraints
S1
R1 L1 T1
R2 L2 T2
S2
04/18/23 3
Investigative ApproachInvestigative Approach
Demonstrate network applications on FPGA-based soft Demonstrate network applications on FPGA-based soft
for target applicationfor target application Number of processorsNumber of processors Interconnection networkInterconnection network Memory hierarchyMemory hierarchy Custom co-processorsCustom co-processors
Cost reduction by avoiding custom Cost reduction by avoiding custom
siliconsilicon
Productivity gains due to software Productivity gains due to software
abstractionabstraction
ProcessingElement
ProcessingElement
Co-Processor
Memory
Architecture Building Blocks
BusQueue
Xilinx Virtex-II Pro, Virtex-IV family of
FPGAs
PowerPC (hard)
MicroBlaze (soft)
FSL
OPB
PLB
Hardware acceleration
EthernetOff-chip SDRAM
On-chip BRAM
PECo-PE PE Co-PE
MEM MEM
MEM PE Co-PE
MEM
PERIPHERALMEM
Multiprocessor Configuration
Blaze(soft)PowerPC(hard)
Hash engineCrypto engine
BRAM(on-chip)SDRAM(off-chip)
FSL OPBPLB
04/18/23 5
Obstacles to Their Adoption: Hard to designObstacles to Their Adoption: Hard to design
Complex micro-architecture design space Complex micro-architecture design space Processor choicesProcessor choices
Memory hierarchyMemory hierarchy
Communication topologyCommunication topology
Difficult mapping decisionsDifficult mapping decisions assigning computation to processing elementsassigning computation to processing elements
data to exposed heterogeneous memories data to exposed heterogeneous memories
To unlock potential of these systems, tools enabling efficiency and To unlock potential of these systems, tools enabling efficiency and
Heuristic methodsHeuristic methods list scheduling, force directed schedulinglist scheduling, force directed scheduling
Exact methodsExact methods enumeration and tabu search, branch-and-boundenumeration and tabu search, branch-and-bound
Limitations of these approachesLimitations of these approaches Specific implementation constraints are hard to enforceSpecific implementation constraints are hard to enforce Most approaches require per-instance tuning and are hard to generalize – therefore Most approaches require per-instance tuning and are hard to generalize – therefore
poor for design space explorationpoor for design space exploration
04/18/23 9
Constraint Optimization Techniques for Automated Constraint Optimization Techniques for Automated ExplorationExploration
AdvantagesAdvantages Constraint formulations are a formal, yet natural way to capture a mathematical Constraint formulations are a formal, yet natural way to capture a mathematical
optimization problemoptimization problem Implementation constraints specific to a problem can be incorporated easilyImplementation constraints specific to a problem can be incorporated easily Constraint solvers can exhaustively cover a search space without enumerating all Constraint solvers can exhaustively cover a search space without enumerating all
solutionssolutions
Key strategies to improve solver performance: Key strategies to improve solver performance: Decomposition methodsDecomposition methods Variable orderingVariable ordering Improved lower and upper boundsImproved lower and upper bounds Symmetry representationSymmetry representation
04/18/23 10
ILP FormulationILP Formulation
04/18/23 11
Example Application: IPv4 Packet Forwarding Example Application: IPv4 Packet Forwarding Data plane of IPv4 packet forwarding (RFC-1812)Data plane of IPv4 packet forwarding (RFC-1812)
Campus network router, Home routerCampus network router, Home router Medium sized route table (5,000 entries or less)Medium sized route table (5,000 entries or less) Route table small enough to fit in on-chip memoryRoute table small enough to fit in on-chip memory
Target platformTarget platform Xilinx Virtex-II Pro 2VP50 FPGAXilinx Virtex-II Pro 2VP50 FPGA
Lookup: inspect destination address and find next hop
–Longest prefix match–Implementation
determined by route distribution, memory and performance constraints
04/18/23 12
Hand-tuned Multiprocessor Design for IPv4 ForwardingHand-tuned Multiprocessor Design for IPv4 Forwarding
Achieved 1.8 Gbps throughput for header processingAchieved 1.8 Gbps throughput for header processing using 12 MicroBlaze processorsusing 12 MicroBlaze processors
Verifyver & ttl
checksumLookup1
Verifyver & ttl
checksumLookup1
Verifyver & ttl
checksumLookup1
Verifyver & ttl
checksumLookup1
RouteTable
From source
MicroBlaze 1
From source
MicroBlaze 2
To source
MicroBlaze 1
To source
MicroBlaze 2
To source
MicroBlaze 2
To source
MicroBlaze 1
Key:
MicroBlaze
Block RAM
Bus
Queue
Lookup2
Lookup2
Lookup2
Lookup2
RouteTable
To source
MicroBlaze 1
04/18/23 13
Improved Design after Automated ExplorationImproved Design after Automated Exploration
Resulting design achieved 2.0 Gbps throughput Resulting design achieved 2.0 Gbps throughput surpassing performance of a 1.8 Gbps hand-tuned designsurpassing performance of a 1.8 Gbps hand-tuned design using one less MicroBlaze processorusing one less MicroBlaze processor
The improvement was due to a less regular configuration and balanced workload of tasks The improvement was due to a less regular configuration and balanced workload of tasks across the processorsacross the processors
applications applications 1000’s-10,000’s of tasks1000’s-10,000’s of tasks
Extend to bigger Extend to bigger
multiprocessor systemsmultiprocessor systems100’s-1000’s of PE’s 100’s-1000’s of PE’s
04/18/23 16
What can we do for RAMP?What can we do for RAMP?
Challenges in deploying concurrent applications on a RAMP systemChallenges in deploying concurrent applications on a RAMP system Task allocation and scheduling across 100’s – 1000’s of PEsTask allocation and scheduling across 100’s – 1000’s of PEs
Fast mapping step to enable efficient design space explorationFast mapping step to enable efficient design space exploration
Our optimization techniques for static task allocation and scheduling Our optimization techniques for static task allocation and scheduling
are a first step to address these challengesare a first step to address these challenges A “compile-time” tool to guide the designer to explore efficient mappingsA “compile-time” tool to guide the designer to explore efficient mappings
Flexible formulation to target diverse multiprocessorsFlexible formulation to target diverse multiprocessors
Research in progress to extend our techniques to work on problems in the Research in progress to extend our techniques to work on problems in the scale of RAMP systemsscale of RAMP systems
04/18/23 17
Backup SlidesBackup Slides
04/18/23 18
ExampleExample
Optimal design found in less Optimal design found in less
than 6 seconds on 400MHz than 6 seconds on 400MHz
Sparc IISparc II
Architecture
P11
P11 P2
1P2
1
P12
P12
M1
MicroBlazes
Power PC
BRAMs
Communication
FSLs Bus
2VP50
Optimal design
explore
Application
04/18/23 19
Following Moore’s LawFollowing Moore’s Law
Extend to more complex applications Extend to more complex applications 1000’s-10,000’s of tasks1000’s-10,000’s of tasks
DSLAMDSLAM
Extend to bigger multiprocessor systemsExtend to bigger multiprocessor systems100’s-1000’s of PE’s 100’s-1000’s of PE’s
RAMPRAMP
04/18/23 20
Challenges in Automated ExplorationChallenges in Automated Exploration
Higher exploration complexityHigher exploration complexity Increases by 2 orders of magnitude Increases by 2 orders of magnitude
More emphasis on communicationMore emphasis on communicationArbitration modelingArbitration modeling
Routing constraints due to network topology Routing constraints due to network topology
Statistical cost model for dynamic behaviorStatistical cost model for dynamic behavior
04/18/23 21
Potential Approaches to Address these ChallengesPotential Approaches to Address these Challenges
Additional constraints can be easily added to incorporate Additional constraints can be easily added to incorporate
new featuresnew features
Constraint solver performance will slow down and thus Constraint solver performance will slow down and thus
become the bottleneckbecome the bottleneck
Some strategies to improve constraint solver performanceSome strategies to improve constraint solver performanceTask graph based structural decompositionsTask graph based structural decompositions
Relaxation heuristicsRelaxation heuristics
Symmetry representationSymmetry representation
Cutting planes and valid inequalitiesCutting planes and valid inequalities