Multithreaded Processors

MultithreadedMultithreaded ProcessorsProcessors

A) IntroductionA) Introduction B) Processor ArchitectureB) Processor Architecture C) Instruction Scheduling StrategyC) Instruction Scheduling Strategy D) Static Code SchedulingD) Static Code Scheduling E) EstimationE) Estimation F) ConclusionF) Conclusion G) References G) References

A) IntroductionA) Introduction Why do we present multithreaded processor architecture ?Why do we present multithreaded processor architecture ?

The generation of high quality images requires great The generation of high quality images requires great processing power. Furthermore , modelling the real world processing power. Furthermore , modelling the real world as faithfully as possible , intensive numerical computations as faithfully as possible , intensive numerical computations are also needed.This architecture could run such a graphics are also needed.This architecture could run such a graphics system.system.

To give an example ; To give an example ;

Simulation results show that by executing 2 and 4 threads Simulation results show that by executing 2 and 4 threads in parallel on a nine-functional-unit processor , a 2.02 and a in parallel on a nine-functional-unit processor , a 2.02 and a 3.72 times speed-up , respectively , can be achieved over a 3.72 times speed-up , respectively , can be achieved over a conventional RISC processor.conventional RISC processor.

IntroductionIntroduction

Why do we use multiple threads?Why do we use multiple threads?

Improves the utilization of the functional unit .Improves the utilization of the functional unit .

Multiple Threads :Multiple Threads : Instructions from different threads are issuedInstructions from different threads are issued

simultaneously to multiple functional units , and simultaneously to multiple functional units , and

these instructions can begin execution unless these instructions can begin execution unless there are functional unit conflicts.there are functional unit conflicts.

Applicable to the efficient execution of a single loop.Applicable to the efficient execution of a single loop.

Scheduling Technique : In order to control functional unit conflicts Scheduling Technique : In order to control functional unit conflicts between loop iterations , a new static code between loop iterations , a new static code scheduling technique has been developed. scheduling technique has been developed.


Single Thread Execution has some disadvantages :Single Thread Execution has some disadvantages :

Each thread executes a number of data accesses and Each thread executes a number of data accesses and

conditional branches . In the case of a distrubuted conditional branches . In the case of a distrubuted shared shared

memory system ;memory system ;

1.1. Low processor utilization can result from long Low processor utilization can result from long latencies due to remote memory access. latencies due to remote memory access.

2.2. Low utilization of functional units within a Low utilization of functional units within a processor can result from inter-instruction processor can result from inter-instruction dependencies and functional operation delays.dependencies and functional operation delays.


Concurrent MultithreadingConcurrent Multithreading Attemps to remain active Attemps to remain active

during long latencies due to during long latencies due to remote memory access. remote memory access.

When a thread encounters an When a thread encounters an absence of data , the absence of data , the processor rapidly switches processor rapidly switches between threads.between threads.

Parallel MultithreadingParallel Multithreading

Within a processor is a Within a processor is a latency hiding technique at latency hiding technique at the instruction level.the instruction level.

When an instruction from a When an instruction from a thread is not able to be thread is not able to be issued because of either a issued because of either a control or data control or data dependency within the dependency within the thread , an independent thread , an independent instruction from another instruction from another thread is executed.thread is executed.

B) Processor ArchitectureB) Processor Architecture

Processor ArchitectureProcessor Architecture The processor is provided with several instruction queue The processor is provided with several instruction queue

unit and decode unit pairs , called unit and decode unit pairs , called thread slotsthread slots . . Each thread slot , associated with a program counter , Each thread slot , associated with a program counter ,

makes up a makes up a logical processorlogical processor . .

Instruction queue unitInstruction queue unit : Has a buffer which saves some : Has a buffer which saves some instructions succeeding the instruction indicated instructions succeeding the instruction indicated by the program counter .by the program counter .

Decode UnitDecode Unit : Gets an instruction from an instruction queue : Gets an instruction from an instruction queue unit and decodes it. Branch instructions are executed within unit and decodes it. Branch instructions are executed within the decode unit.the decode unit.

Processor ArchitectureProcessor Architecture

Issued instructions are dynamically scheduled by Issued instructions are dynamically scheduled by instruction schedule unitsinstruction schedule units and delivered to functional units. and delivered to functional units.

When an instruction is not selected by an instruction When an instruction is not selected by an instruction schedule unit , it is stored in a schedule unit , it is stored in a standby stationstandby station and and remains there until it is selected.remains there until it is selected.

Large Register FilesLarge Register Files : Diveded into blocks , each of which : Diveded into blocks , each of which is used as a full register set private to a thread. Each bank is used as a full register set private to a thread. Each bank has two read ports and one write port. When a thread is has two read ports and one write port. When a thread is executed the bank allocated for the thread is logically executed the bank allocated for the thread is logically bound to the logical processor.bound to the logical processor.

Queue RegistersQueue Registers : Special registers which enable : Special registers which enable communications between logical processors at the register-communications between logical processors at the register-transfer level.transfer level.

Processor ArchitectureProcessor Architecture Instruction PipelineInstruction Pipeline IF ; Instruction is read IF ; Instruction is read

from a buffer of an inst. from a buffer of an inst. queue unit in one cycle .queue unit in one cycle .

D1 ; The format or type D1 ; The format or type of the instruction is of the instruction is tested. In the case of tested. In the case of branch instruction an branch instruction an inst. fetch request is sent inst. fetch request is sent to the inst. fetch unit at to the inst. fetch unit at the end of the stage D1.the end of the stage D1.

D2 ; The instruction is D2 ; The instruction is checked if it is issuable checked if it is issuable or not.or not.

Processor ArchitectureProcessor Architecture

Instruction PipelineInstruction Pipeline S ; This stage is inserted S ; This stage is inserted

for the dynamic scheduling for the dynamic scheduling in instruction schedule in instruction schedule units. Required operands units. Required operands are read from registers in are read from registers in stage S .stage S .

EX ; This stage is EX ; This stage is dependent on the kind of dependent on the kind of instruction .instruction .

W ; Result value is written W ; Result value is written back to a register.back to a register.

C) Instruction Scheduling StrategyC) Instruction Scheduling Strategy

Dynamic inst. scheduling Dynamic inst. scheduling policy is presented in the inst. policy is presented in the inst. schedule unit which works in schedule unit which works in one of two modes : one of two modes :

1) Implicit-rotation mode : 1) Implicit-rotation mode : Priority rotation occurs at a Priority rotation occurs at a given # of cycles . (rotation given # of cycles . (rotation interval) as shown in figure 4.interval) as shown in figure 4.

2) Explicit-rotation mode : 2) Explicit-rotation mode : Rotation of priority is Rotation of priority is controlled by software. The controlled by software. The rotation is done when a rotation is done when a change-priority instruction is change-priority instruction is executed on the logical executed on the logical processor with the highest processor with the highest priority.priority.

Instruction Scheduling StrategyInstruction Scheduling Strategy

Why explicit-rotation mode is used for our architecture ?Why explicit-rotation mode is used for our architecture ?

1)1) To aid the compiler to schedule the code of threads To aid the compiler to schedule the code of threads executed in parallel when it is possible.executed in parallel when it is possible.

2)2) To parallelize loops which are difficult to parallelize To parallelize loops which are difficult to parallelize using other architectures.using other architectures.

D) Static Code SchedulingD) Static Code Scheduling

Main GoalMain Goal : The complier reorders the code without : The complier reorders the code without consideration of consideration of

other threads , and concentrates on other threads , and concentrates on shortening the the shortening the the processing timeprocessing time

each thread. each thread.

A new algorithm which makes the most of the function A new algorithm which makes the most of the function standby stationstandby station

and instruction schedule units has been developed.and instruction schedule units has been developed.

Algorithm employs a resource reservation table and standby Algorithm employs a resource reservation table and standby table.table.

Static Code SchedulingStatic Code Scheduling

Resource Reservationtion Table Resource Reservationtion Table

1.1. To avoid resource conflicts.To avoid resource conflicts.

2.2. To tell the complier when the instruction in the standby To tell the complier when the instruction in the standby station station

is executed.is executed.

Standby Table Standby Table

1.1. Stores theStores the instructions which are not issued .instructions which are not issued .

Explicit-rotation mode enables the complier to know Explicit-rotation mode enables the complier to know which which

instruction is selected.instruction is selected.

E) EstimationE) Estimation In our simulator cache In our simulator cache

simulation has not been simulation has not been implemented so we assumed implemented so we assumed that attempts to access that attempts to access caches were all hit.caches were all hit.

Assumed that there was no Assumed that there was no bank conflict . Latencies of bank conflict . Latencies of each instruction are listed in each instruction are listed in Table 1.Table 1.

In order to estimate our In order to estimate our architecture , we use the architecture , we use the speed-up ratio as a criterion.speed-up ratio as a criterion.

Estimation Estimation 1.83 times speed-up is gained by parallel execution by using 2 1.83 times speed-up is gained by parallel execution by using 2 thread slots although all of the hardware of a single-threadedthread slots although all of the hardware of a single-threadedprocessor is not duplicated in processor.processor is not duplicated in processor. By using 4 thread slots 2.89/1.83 = 1.58 we gain less effective By using 4 thread slots 2.89/1.83 = 1.58 we gain less effective increase.increase. When 8 thread slots are provided , the utilization of the busiest When 8 thread slots are provided , the utilization of the busiest functional unit , load/store unit , becomes 99%. This is the reason why functional unit , load/store unit , becomes 99%. This is the reason why speed up is saturated at only 3.22 times. speed up is saturated at only 3.22 times. Addition of another load/store unit improves speed-up ratios by Addition of another load/store unit improves speed-up ratios by 10.4~79.8 %.10.4~79.8 %.

EstimationEstimation Stand-by stations improve the speed-up ratio by 0~2.2% .Stand-by stations improve the speed-up ratio by 0~2.2% . In the case of application programs whose thread is rich in In the case of application programs whose thread is rich in

fine-grained parallelism , greater improvement can be fine-grained parallelism , greater improvement can be achieved .achieved .

EstimationEstimation The sample program is the Livermore Kernel 1 written in Fortran.The sample program is the Livermore Kernel 1 written in Fortran. Table 3 lists avarage execution cycles for one iteration.Table 3 lists avarage execution cycles for one iteration. Strategy A represents a simple list scheduling approach Strategy A represents a simple list scheduling approach Strategy B represents the list scheduling with resourse reservationStrategy B represents the list scheduling with resourse reservation

table and a standby table.table and a standby table.

EstimationEstimation

Strategy B is overall Strategy B is overall superior to other superior to other strategies. It achieves the strategies. It achieves the performance improvement performance improvement by 0~19.3 % .by 0~19.3 % .

The object code contains The object code contains three load instructşons and three load instructşons and store instruction , so at store instruction , so at least (3+1)*2=8 cycles are least (3+1)*2=8 cycles are requiredrequired

for one iteration.for one iteration.

F) ConclusionF) Conclusion

1.1. 2-threaded , 2 load/store unit processor achieves a factor 2-threaded , 2 load/store unit processor achieves a factor of 2.02 speed up over a sequential machine and 4 of 2.02 speed up over a sequential machine and 4 threaded processor achieves a factor of 3.72.threaded processor achieves a factor of 3.72.

2.2. A new static code scheduling algorithm has been A new static code scheduling algorithm has been developed , which is derived from idea of software developed , which is derived from idea of software pipelining. pipelining.

3.3. Poor variety of tested programs ( ex: cache effects ...) is Poor variety of tested programs ( ex: cache effects ...) is the weak point . the weak point .

4.4. Working on evaluating finite cache effects and the Working on evaluating finite cache effects and the detailed degisn of the processor help us to confirm the detailed degisn of the processor help us to confirm the effectiveness of architecture.effectiveness of architecture.

G) ReferencesG) References

1.1. H.Hirata, Kozo Kimura,Satoshi Nagamine, Yoshiyuki H.Hirata, Kozo Kimura,Satoshi Nagamine, Yoshiyuki Mochizuki, Akio Nishimura, Yoshimori Nakase and Teiji Mochizuki, Akio Nishimura, Yoshimori Nakase and Teiji Nishizawa , “An Elementary Processor Architecture with Nishizawa , “An Elementary Processor Architecture with Simultaneous Instruction Issuıng From Multiplr Threads,” Simultaneous Instruction Issuıng From Multiplr Threads,” In International Symposium on Computer Architecture , In International Symposium on Computer Architecture , pages 136-145 , 1992pages 136-145 , 1992

2.2. Alexandre Farcy , Olivier Temam , “Improving Single-Alexandre Farcy , Olivier Temam , “Improving Single-Process Performance with Multithreaded Processors,” Process Performance with Multithreaded Processors,” Universite de Versailles ,Paris.Universite de Versailles ,Paris.

3.3. H.Hirata, Yoshiyuki Mochizuki, Akio Nishimura, Yoshimori H.Hirata, Yoshiyuki Mochizuki, Akio Nishimura, Yoshimori Nakase and Teiji Nishizawa, “A Multithreaded Processor Nakase and Teiji Nishizawa, “A Multithreaded Processor Architecture with Simultaneous Instruction Issuing” In Architecture with Simultaneous Instruction Issuing” In Proc.of Intl. Symp. On Supercomputing, Fukuoka,Japan, Proc.of Intl. Symp. On Supercomputing, Fukuoka,Japan, pp.87-96 , November 1991.pp.87-96 , November 1991.

Multithreaded Processors

Documents