Particle Swarm Optimization for Run-Time Particle Swarm Optimization for Run-Time Task Decomposition and Scheduling in Task Decomposition and Scheduling in Evolvable MPSoC Evolvable MPSoC Shervin Vakili, Sied Mehdi Fakhraie, Siamak Mohammadi, and Shervin Vakili, Sied Mehdi Fakhraie, Siamak Mohammadi, and Ali Ahmadi Ali Ahmadi International Conference on Computer Engineering and Technology 2009 (ICCET International Conference on Computer Engineering and Technology 2009 (ICCET 2009) 2009) January 24, 2009 January 24, 2009
26
Embed
Particle Swarm Optimization for Run-Time Task Decomposition and Scheduling in Evolvable MPSoC Shervin Vakili, Sied Mehdi Fakhraie, Siamak Mohammadi, and.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Particle Swarm Optimization for Run-Time Task Particle Swarm Optimization for Run-Time Task
Decomposition and Scheduling in Evolvable MPSoCDecomposition and Scheduling in Evolvable MPSoC
Shervin Vakili, Sied Mehdi Fakhraie, Siamak Mohammadi, and Ali Shervin Vakili, Sied Mehdi Fakhraie, Siamak Mohammadi, and Ali AhmadiAhmadi
International Conference on Computer Engineering and Technology 2009 (ICCET International Conference on Computer Engineering and Technology 2009 (ICCET 2009)2009)
January 24, 2009January 24, 2009
OutlineOutline
►Why MPSoC?Why MPSoC?► Introduction to EvoMP Introduction to EvoMP ►Processing Platform Hardware Processing Platform Hardware
ArchitectureArchitecture►PSO Hardware CorePSO Hardware Core►Simulation and Synthesis ResultsSimulation and Synthesis Results
Why MPSoC?Why MPSoC?
►An emerging trend to design high An emerging trend to design high performance computing architectures.performance computing architectures.
►Have most of the desirable advantages Have most of the desirable advantages of single-processor solutions such as of single-processor solutions such as short time-to market, post-fabricate short time-to market, post-fabricate reusability, flexibility and reusability, flexibility and programmability programmability
►Moving toward large number of simple Moving toward large number of simple processors on a chipprocessors on a chip
MPSoC Development MPSoC Development ChallengesChallenges
► Programming models: MP systems Programming models: MP systems require concurrent software. Two main require concurrent software. Two main solutions:solutions:
Software development using parallel Software development using parallel models e.g. OpenMP and MPImodels e.g. OpenMP and MPI
► ““Software developers have been well-trained Software developers have been well-trained by sixty years of computing history to think in by sixty years of computing history to think in terms of sequentially defined applications terms of sequentially defined applications code” [2]code” [2]
► Requires huge investment to re-develop Requires huge investment to re-develop existing softwareexisting software
MPSoC Development Challenges MPSoC Development Challenges (2)(2)
Automatic parallelization at compile-timeAutomatic parallelization at compile-time► Does not require reprogramming but requires Does not require reprogramming but requires
re-compilation re-compilation ► Such compiler must solve two complex Such compiler must solve two complex
problems:problems: Decomposition of the program into some tasksDecomposition of the program into some tasks Scheduling the tasks among cooperating processorsScheduling the tasks among cooperating processors
► Both task decomposition and scheduling Both task decomposition and scheduling operations are NP-complete problemsoperations are NP-complete problems
► G. Martin [2]: “Decomposition of an application described in a G. Martin [2]: “Decomposition of an application described in a serial fashion into a set of concurrent or parallel tasks that can serial fashion into a set of concurrent or parallel tasks that can cooperate in an orderly and predictable way is one of the most cooperate in an orderly and predictable way is one of the most difficult jobs imaginable and despite of forty or more years of difficult jobs imaginable and despite of forty or more years of intensive research in this area there are very few applications intensive research in this area there are very few applications for which this can be done automatically”.for which this can be done automatically”.
MPSoC Development Challenges MPSoC Development Challenges (3)(3)
► All MPSoCs can be divided into two All MPSoCs can be divided into two categories:categories: Static SchedulingStatic Scheduling
► Task scheduling is performed before run-timeTask scheduling is performed before run-time► Number of contributing processors must be Number of contributing processors must be
predeterminedpredetermined
Dynamic scheduling (Dynamic scheduling (e.g. current multi-core PC e.g. current multi-core PC processorsprocessors))
► A run-time scheduler (in hardware, middleware, or OS) A run-time scheduler (in hardware, middleware, or OS) is in charge of task schedulingis in charge of task scheduling
► Does not require prior information about number of Does not require prior information about number of available processors (desirable for fault tolerance)available processors (desirable for fault tolerance)
Introduction to EvoMPIntroduction to EvoMP
► An NoC-Based Homogeneous Multi-An NoC-Based Homogeneous Multi-processor system with evolvable task processor system with evolvable task decomposition and schedulingdecomposition and scheduling
► Features:Features: Distributed control and computingDistributed control and computing ScalableScalable Does not need parallel programmingDoes not need parallel programming
► One of the main difficulties in parallel processingOne of the main difficulties in parallel processing► Requires reprogramming all the developed Requires reprogramming all the developed
(sequential) software(sequential) software
Introduction to EvoMP (2)Introduction to EvoMP (2)
►Features:Features: All computational units have one copy of All computational units have one copy of
the entire programthe entire program A hardware PSO core is exploited in A hardware PSO core is exploited in
EvoMP architecture to generates a bit-EvoMP architecture to generates a bit-stringstring
►Specifies each instruction must be executed Specifies each instruction must be executed in which processorin which processor
Our first version of EvoMP had used a Our first version of EvoMP had used a genetic algorithm core [8]genetic algorithm core [8]
Introduction to EvoMP (3)Introduction to EvoMP (3)
►Target Applications: Applications, Target Applications: Applications, which perform a unique computation which perform a unique computation on a stream of data, e.g.:on a stream of data, e.g.: digital signal processing of video and digital signal processing of video and
audio signalsaudio signals Different codec standardsDifferent codec standards Huge sensory data processingHuge sensory data processing Packet processing in network applicationsPacket processing in network applications … …
EvoMP Top ViewEvoMP Top View► PSO core produces a bit-PSO core produces a bit-
string (particle) which string (particle) which determines the location of determines the location of execution of each execution of each instruction at the beginning instruction at the beginning of each Iterations.of each Iterations.SW00
Cell-01
SW01
SW10 SW11
Cell-00
Cell-11Cell-10
PSO Core
1- MOV R1, 01- MOV R1, 0
2- MOV R2, 02- MOV R2, 0
L1:L1: ;Loop;Loop
3- MOV R1, Input3- MOV R1, Input
4- MUL R3, R1, 4- MUL R3, R1, Coe1Coe1
5- MUL R4, R2, 5- MUL R4, R2, Coe2Coe2
6- ADD R1, R3, R46- ADD R1, R3, R4
7- MOV Output, R17- MOV Output, R1
8- MOV R1, R28- MOV R1, R2
9- JUMP L19- JUMP L1
1- MOV R1, 01- MOV R1, 0
2- MOV R2, 02- MOV R2, 0
L1:L1: ;Loop;Loop
3- MOV R1, Input3- MOV R1, Input
4- MUL R3, R1, 4- MUL R3, R1, Coe1Coe1
5- MUL R4, R2, 5- MUL R4, R2, Coe2Coe2
6- ADD R1, R3, R46- ADD R1, R3, R4
7- MOV Output, R17- MOV Output, R1
8- MOV R1, R28- MOV R1, R2
9-JUMP L19-JUMP L1
1- MOV R1, 01- MOV R1, 0
2- MOV R2, 02- MOV R2, 0
L1:L1: ;Loop;Loop
3- MOV R1, Input3- MOV R1, Input
4- MUL R3, R1, 4- MUL R3, R1, Coe1Coe1
5- MUL R4, R2, 5- MUL R4, R2, Coe2Coe2
6- ADD R1, R3, R46- ADD R1, R3, R4
7- MOV Output, R17- MOV Output, R1
8- MOV R1, R28- MOV R1, R2
9- JUMP L19- JUMP L1
1- MOV R1, 01- MOV R1, 0
2- MOV R2, 02- MOV R2, 0
L1:L1: ;Loop;Loop
3- MOV R1, Input3- MOV R1, Input
4- MUL R3, R1, 4- MUL R3, R1, Coe1Coe1
5- MUL R4, R2, 5- MUL R4, R2, Coe2Coe2
6- ADD R1, R3, R46- ADD R1, R3, R4
7- MOV Output, R17- MOV Output, R1
8- MOV R1, R28- MOV R1, R2
9- JUMP L19- JUMP L1
Particle: Particle: 01101010…1101101010…11
How Chromosome Codes the How Chromosome Codes the Scheduling DataScheduling Data
► Streaming applications Streaming applications have two main parts:have two main parts: InitializationInitialization Infinite (or semi-Infinite (or semi-
infinite) Loopinfinite) Loop
;Initial;Initial
1- 1- MOV R1, 0MOV R1, 0
2- MOV R2, 02- MOV R2, 0
L1:L1: ;Loop;Loop
3- MOV R1, Input3- MOV R1, Input
4- MUL R3, R1, 4- MUL R3, R1, Coe1Coe1
5- MUL R4, R2, 5- MUL R4, R2, Coe2Coe2
6- ADD R1, R3, R46- ADD R1, R3, R4
7- MOV Output, R17- MOV Output, R1
8- MOV R1, R28- MOV R1, R2
9- PSO9- PSO
10-JUMP L110-JUMP L1
How EvoMP WorksHow EvoMP Works
► Following process is repeated for each iteration:Following process is repeated for each iteration: At the beginning of each iteration, PSO core At the beginning of each iteration, PSO core
generates and sends the bit-string (particle) to all generates and sends the bit-string (particle) to all processorsprocessors
Then processor executes this iteration of the Then processor executes this iteration of the program with the decomposition and scheduling program with the decomposition and scheduling scheme specified by this bit-stringscheme specified by this bit-string
An internal counter in PSO core is used to count An internal counter in PSO core is used to count number of spent clock cycles meanwhile execution number of spent clock cycles meanwhile execution of each iterationof each iteration
When all processors reached the end of the loop, the When all processors reached the end of the loop, the PSO core uses the output of this counter as the PSO core uses the output of this counter as the fitness value of the last generated particlefitness value of the last generated particle
How EvoMP Works (2)How EvoMP Works (2)
► The system has three main statesThe system has three main states Initialize:Initialize:
► Just in first populationJust in first population► PSO core generates random particlesPSO core generates random particles
► Evolution: Evolution: PSO core produces the new population through particular computations PSO core produces the new population through particular computations
using best previously archived particles using best previously archived particles When the termination condition is met, system goes to final stateWhen the termination condition is met, system goes to final state
► Final: Final: The best particle achieved in evolution stage is used as constant The best particle achieved in evolution stage is used as constant
output of the PSO coreoutput of the PSO core When one of the processors becomes faulty the system returns to When one of the processors becomes faulty the system returns to
evolution stage to perform re-evolution (beneficial for fault tolerance evolution stage to perform re-evolution (beneficial for fault tolerance capability of the EvoMP)capability of the EvoMP)
Initialize Evolution Final
Fault detected
Terminate
How Chromosome Codes the How Chromosome Codes the Scheduling Data (1)Scheduling Data (1)
►Each bit-string (Particle) consists of Each bit-string (Particle) consists of some small words (Sub-Particles)some small words (Sub-Particles)
►Each Sub-Particles contains two fields:Each Sub-Particles contains two fields: A processor number A processor number A limited number which specifies number A limited number which specifies number
of instructions which must be executed in of instructions which must be executed in specified processor in first fieldspecified processor in first field
How Chromosome Codes the How Chromosome Codes the Scheduling Data (2)Scheduling Data (2)
► Assume that we have a 2X2 meshAssume that we have a 2X2 mesh
10 001
# of Instructions
00 01011 00010 101
10
00
Particle
11
10
00
10
01
11
;Initial;Initial
1- 1- MOV R1, 0MOV R1, 0
2- MOV R2, 02- MOV R2, 0
L1:L1: ;Loop;Loop
3- MOV R1, Input3- MOV R1, Input
4- MUL R3, R1, 4- MUL R3, R1, Coe1Coe1
5- MUL R4, R2, 5- MUL R4, R2, Coe2Coe2
6- ADD R1, R3, R46- ADD R1, R3, R4
7- MOV Output, R17- MOV Output, R1
8- MOV R1, R28- MOV R1, R2
9- GENETIC9- GENETIC
10-JUMP L110-JUMP L1
Inter-Processor Data Inter-Processor Data DependenciesDependencies
► Inter-processor data dependencies Inter-processor data dependencies are detected in source processor using are detected in source processor using
architectural mechanismsarchitectural mechanisms Source processor transmits the required Source processor transmits the required
data for the destination one(s) through data for the destination one(s) through NoC NoC
Does not require request-send schemeDoes not require request-send scheme
Architecture of each Architecture of each ProcessorProcessor
► Number of FUs is a configurable parameterNumber of FUs is a configurable parameter► Supports out of order executionSupports out of order execution► First free FU grabs the instruction from First free FU grabs the instruction from InstrInstr bus bus
and send a signal to and send a signal to Fetch_IssueFetch_Issue to fetch next to fetch next instructioninstruction
►An stochastic population-based An stochastic population-based evolutionary algorithm evolutionary algorithm
►Ties to find the optimum solution over Ties to find the optimum solution over the search space by the search space by sampling points and converging the sampling points and converging the
swarm on the most promising regions swarm on the most promising regions Number of these sampling points (called Number of these sampling points (called
particle) is constant (population size) particle) is constant (population size) Each sampling point is a candidate Each sampling point is a candidate
► Parameters:Parameters: Population size=16Population size=16 NoC connection NoC connection
width=16width=16
► 324 Instructions324 Instructions► 128 128
multiplicationmultiplication
► Execution results of 16-point Descrete Cosine Transform on Execution results of 16-point Descrete Cosine Transform on different-size EvoMPs different-size EvoMPs
► Best fitness shows number of clock cycles Best fitness shows number of clock cycles required to execute one iteration using the required to execute one iteration using the best particle which has been found yet.best particle which has been found yet.
► Execution results of 5x5 Matrix multiplication on different-size Execution results of 5x5 Matrix multiplication on different-size EvoMPs EvoMPs
MAT-5x5
0
500
1000
1500
2000
2500
3000
3500
0 50000 100000 150000 200000 250000 300000
Time (us)
Bes
t F
itn
ess
(Clo
ck C
ycle
s) 1 Processor
2 Processors
3 Processors
5 Processors
► Parameters:Parameters: Population size=16Population size=16 NoC connection NoC connection
width=16width=16
Final Evolution Phase Results Final Evolution Phase Results
► Following table shows final results achived in evolution phase (and Following table shows final results achived in evolution phase (and corresponding evolution time) in both genetic-based and PSO-based corresponding evolution time) in both genetic-based and PSO-based EvoMPs.EvoMPs.
► These results shows small improvement in final results and These results shows small improvement in final results and convergence time in PSO-based system.convergence time in PSO-based system.
► Follwing table shows the synthesis results of bothe PSO and genetic Follwing table shows the synthesis results of bothe PSO and genetic cores on a VIRTEX II (cores on a VIRTEX II (XC2V3000) FPGAXC2V3000) FPGA
ReferencesReferences[1] [1] A. A. Jerraya and W. Wolf, A. A. Jerraya and W. Wolf, Multiprocessor Systems-on-ChipsMultiprocessor Systems-on-Chips, San Francisco: Morgan , San Francisco: Morgan
Kaufmann Publishers, 2005.Kaufmann Publishers, 2005.[2] G. Martin, “Overview of the MPSoC design challenge,” [2] G. Martin, “Overview of the MPSoC design challenge,” Proc. Design and Automation Proc. Design and Automation
Conf.Conf., July 2005, pp. 274-279., July 2005, pp. 274-279.[3] M. Hubner, K. Paulsson, and J. Becker, “Parallel and flexible multiprocessor system-[3] M. Hubner, K. Paulsson, and J. Becker, “Parallel and flexible multiprocessor system-
on-chip for adaptive automo tive applications based on Xilinx MicroBlaze soft-on-chip for adaptive automo tive applications based on Xilinx MicroBlaze soft-cores,” cores,” Proc. Int. Symp. Parallel and Distributed ProcessingProc. Int. Symp. Parallel and Distributed Processing, 2005, pp. 149.1., 2005, pp. 149.1.
[4] D. Gohringer, M. Hubner, V. Schatz, and J. Becker, “Runtime adaptive multi-[4] D. Gohringer, M. Hubner, V. Schatz, and J. Becker, “Runtime adaptive multi-processor system-on-chip: RAMP SoC,” processor system-on-chip: RAMP SoC,” Proc. Int. Symp. Parallel and Distributed Proc. Int. Symp. Parallel and Distributed ProcessingProcessing, Apr. 2008, pp. 1-7., Apr. 2008, pp. 1-7.
[5] A. Klimm, L. Braun, and J. Becker, “An adaptive and scalable multiprocessor system [5] A. Klimm, L. Braun, and J. Becker, “An adaptive and scalable multiprocessor system for Xilinx FPGAs using minimal sized processor cores,” for Xilinx FPGAs using minimal sized processor cores,” Proc. Symp. Parallel and Proc. Symp. Parallel and Distributed ProcessingDistributed Processing, April 2008, pp. 1-7., April 2008, pp. 1-7.
[6] A. Farmahini-Farahani, S. Vakili, S. M. Fakhraie, S. Safari, and C. Lucas, “Parallel [6] A. Farmahini-Farahani, S. Vakili, S. M. Fakhraie, S. Safari, and C. Lucas, “Parallel scalable hardware implementation of asynchronous discrete particle swarm scalable hardware implementation of asynchronous discrete particle swarm optimization,” optimization,” Elsevier J. of Engineering Applications of Artificial IntelligenceElsevier J. of Engineering Applications of Artificial Intelligence , , submitted for publication. submitted for publication.
[7] A.J. Page and T.J. Naughton, “Dynamic task scheduling using genetic algorithms for [7] A.J. Page and T.J. Naughton, “Dynamic task scheduling using genetic algorithms for heterogeneous distributed computing,” heterogeneous distributed computing,” Proc. Int. Symp. Parallel and Distributed Proc. Int. Symp. Parallel and Distributed ProcessingProcessing, April 2005, pp. 189.1., April 2005, pp. 189.1.
[8] S. Vakili, S. M. Fakhraie, and S. Mohammadi, “EvoMP: a novel MPSoC architecture [8] S. Vakili, S. M. Fakhraie, and S. Mohammadi, “EvoMP: a novel MPSoC architecture with evolvable task decompo sition and scheduling,” with evolvable task decompo sition and scheduling,” to be Appeared in IET Comp. & to be Appeared in IET Comp. & Digital Tech.Digital Tech.
[9] E. Carvalho, N. Calazans, and F. Moraes,[9] E. Carvalho, N. Calazans, and F. Moraes, "Heuristics for dynamic task "Heuristics for dynamic task mapping in mapping in NoC-based heterogeneous MPSoCs," NoC-based heterogeneous MPSoCs," Proc. Intl. Rapid System Prototyping Proc. Intl. Rapid System Prototyping WorkshopWorkshop, 2007, pp. 34-40., 2007, pp. 34-40.
Fetch_Issue UnitFetch_Issue Unit
► PC1-Instr bus is used for executive instructionsPC1-Instr bus is used for executive instructions► PC2-Invalidate_Instr bus is used for data PC2-Invalidate_Instr bus is used for data
dependency detectiondependency detection
Scheduling Data FIFO
Scheduled FIFO
=
PC1
PC2
Instruction Memory
+Down Counter
Proc_Addr
Instr_num
Address of this processor
=0
RD
WR
LD
OpcodeR1_ID
R2_ID
Instr
Destination
=Address of this processor
Proc_Addr
Local_Instruction
Send1_ID
Send1_ID
Invalidate_Instr
Destination
WR_Scheduling_Word
Scheduling_Data_Word
Addr1
Addr2
Dout1
Dout2
WR
>
Inc
Inc
Inc
Destination Processor Address
Example: 2-Order FIR filterExample: 2-Order FIR filter