A Multi-core Parallelizing Compiler for Low-Power High- Performance Computing Hironori Kasahara Professor Department of Computer Science Professor , Department of Computer Science Director, Advanced Chip-Multiprocessor Research Institute Research Institute Waseda University Tokyo Japan Tokyo, Japan http://www.kasahara.cs.waseda.ac.jp 1
56
Embed
A Multi-core Parallelizing Compiler for Low-Power High ...vs3/PDF/Kasahara-Seminar-Oct-2007.pdf · A Multi-core Parallelizing Compiler for Low-Power High-Performance Computing ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
A Multi-core Parallelizing gCompiler for Low-Power High-
Performance ComputingHironori Kasahara
Professor Department of Computer ScienceProfessor, Department of Computer ScienceDirector, Advanced Chip-Multiprocessor
Research InstituteResearch Institute Waseda University
Tokyo JapanTokyo, Japanhttp://www.kasahara.cs.waseda.ac.jp
1
Hironori Kasahara<Personal History>
B.S. (1980,Waseda), M.S.(1982,Waseda), Ph.D.(1985,EE, Waseda). Res.Assoc. (1983,Waseda),S i l R h F ll JSPS (1985) Vi i i S h l (1985 U i C lif i B k l )Special Research Fellow JSPS (1985) ,Visiting Scholar (1985.Univ.California at Berkeley), .Assist. Prof. (1986.Waseda), Assoc. Prof.(1988,Waseda), Visiting Research Scholar(1989-1990. Center for Supercomputing R&D, Univ.of Illinois at Urbana-Champaign), Prof.(1997-,Dept. CS, Waseda). , IFAC World Congress Young Author Prize (1987), IPSJ Sakai Memorial Special Award (1997), STARC Industry-Academia Cooperative Research Award (2004)
<Activities for Societies>IPSJ:Sig. Computer Architecture(Chair), Trans of IPSJ Editorial Board (HG Chair),
Journal of IPSJ Editorial Board (HWG Chair), 2001 Journal of IPSJ Special Issue on ParallelJournal of IPSJ Editorial Board (HWG Chair), 2001 Journal of IPSJ Special Issue on Parallel Processing(Chair of Editorial Board: Guest Editor, JSPP2000 (Program Chair) etc. ACM :International Conference on Supercomputing(ICS)(Program Committee)
Int’l conf. on Supercomputing (PC, esp. ‘96 ENIAC 50th Anniversary Co-Prog. Chair). IEEE C t S i t J Ch t Ch i T k S ti B d M b SC07 PCIEEE: Computer Society Japan Chapter Chair, Tokyo Section Board Member, SC07 PCOTHER: PCs of many conferences on Supercomputing and Parallel Processing.
<Activities for Governments>METI:IT Policy Proposal Forum(Architecture/HPC WG Chair), y p ( ),
Super Advanced Electronic Basis Technology Investigation CommitteeNEDO:Millennium Project IT21 “Advanced Parallelizing Compiler”( Project Leader),Computer Strategy WG (Chair).Multicore for Realtime Consumer Electronics Project Leader etc.MEXT:Earth Simulator project evaluation committee, 10PFLOPS Supercomputer evaluat. comm.p j , p pJAERI: Research accomplishment evaluation committee, CCSE 1st class invited researcher. JST: Scientific Research Fund Sub Committee, COINS Steering Committee ,
Precursory Research for Embryonic Science and Technology (Research Area Adviser)Cabinet Office: CSTP Expert Panel on Basic Policy, Information & Communication Field
2
Cabinet Office: CSTP Expert Panel on Basic Policy, Information & Communication Field Promotion Strategy , R&D Infrastructure WG, Software & Security WG
<Papers>151 Papers with Review, 20 Papers for Symposium with Review, 105 Technicar Reports,154 Papers for Annual Convention, 49 Invited Talks, 74 Articles in Newspaper & Web, etc.
Multi-core EverywhereMulti-core from embedded to supercomputers
Consumer Electronics (Embedded)Mobile Phone, Game, Digital TV, Car Navi, DVD,
CCameraIBM/ Sony/ Toshiba Cell, Fuijtsu FR1000, NEC/ARMMPCore&MP211, Panasonic Uniphier, Renesas SH multi-core(RP1)Renesas SH multi core(RP1)
NEDO (2007.02- 2010.03)Heterogeneous Multicore for
Consumer Electronics Waseda Univ., Hitachi, Renesas, Tokyo Inst, of Tech. Mar.Mar.
PlanHetero Multicore Arch.
& Compiler R&D
METI/NEDO National Project Multi core for Real time Consumer ElectronicsMulti-core for Real-time Consumer Electronics
<Goal> R&D of compiler cooperative multi-core processor technology for consumer (2005.7~2008.3)**core processor technology for consumer electronics like Mobile phones, Games, DVD, Digital TV, Car navigation systems. CMP m
(マルチコア・チ プ )
PC0(プロセッサコア0) PC
CMP 0 (マルチコアチップ0)
CSM j
CPU(プロセッサ)
I/O DevicesI/O
(入出力装置)CMP m(マルチコア・チ プ )
PC0(プロセッサコア0) PC
CMP 0 (マルチコアチップ0)
CSM j
CPU(プロセッサ)
I/O DevicesI/O
(入出力装置)
新マルチコアプロセッサ
•高性能
低消費電力
新マルチコアプロセッサ
•高性能
低消費電力
CMP m(マルチコア・チ プ )
PC0(プロセッサコア0) PC
CMP 0 (マルチコアチップ0)
CSM j
CPU(プロセッサ)
I/O DevicesI/O
(入出力装置)CMP m(マルチコア・チ プ )
PC0(プロセッサコア0) PC
CMP 0 (マルチコアチップ0)
CSM j
CPU(プロセッサ)
I/O DevicesI/O
(入出力装置)
新マルチコアプロセッサ
•高性能
低消費電力
新マルチコアプロセッサ
•高性能
低消費電力
( )
<Period> From July 2005 to March 2008
<Features> ・Good cost performance
チップm)
(プロセッサ
コアn)
) PC1(プロ
セッサコア1)
PC n
DSM(分散共有メモリ)
LDM/D-cache
(ローカルデータメモリ/L1データ
キャッシュ)
LPM/I-Cache
(ローカルプログラムメモリ/
命令キャッシュ)
j
CSM(集中共有
メモリ)
I/O
CSP k
(入出力用マルチコア・チップ)
NI(ネットワークインターフェイス)
DTC(データ
転送コントローラ)
チップm)
(プロセッサ
コアn)
) PC1(プロ
セッサコア1)
PC n
DSM(分散共有メモリ)
LDM/D-cache
(ローカルデータメモリ/L1データ
キャッシュ)
LPM/I-Cache
(ローカルプログラムメモリ/
命令キャッシュ)
j
CSM(集中共有
メモリ)
I/O
CSP k
(入出力用マルチコア・チップ)
NI(ネットワークインターフェイス)
DTC(データ
転送コントローラ)
•低消費電力
•短HW/SW開発期間
•各チップ間でアプリケーション
•低消費電力
•短HW/SW開発期間
•各チップ間でアプリケーション
チップm)
(プロセッサ
コアn)
) PC1(プロ
セッサコア1)
PC n
DSM(分散共有メモリ)
LDM/D-cache
(ローカルデータメモリ/L1データ
キャッシュ)
LPM/I-Cache
(ローカルプログラムメモリ/
命令キャッシュ)
j
CSM(集中共有
メモリ)
I/O
CSP k
(入出力用マルチコア・チップ)
NI(ネットワークインターフェイス)
DTC(データ
転送コントローラ)
チップm)
(プロセッサ
コアn)
) PC1(プロ
セッサコア1)
PC n
DSM(分散共有メモリ)
LDM/D-cache
(ローカルデータメモリ/L1データ
キャッシュ)
LPM/I-Cache
(ローカルプログラムメモリ/
命令キャッシュ)
j
CSM(集中共有
メモリ)
I/O
CSP k
(入出力用マルチコア・チップ)
NI(ネットワークインターフェイス)
DTC(データ
転送コントローラ)
•低消費電力
•短HW/SW開発期間
•各チップ間でアプリケーション
•低消費電力
•短HW/SW開発期間
•各チップ間でアプリケーション
・Short hardware and software development periods・Low power consumption
CSM / L2 Cache(集中共有メモリあるいはL2キャッシュ )
IntraCCN (チップ内結合網: 複数バス、クロスバー等 )
InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 )
CSM / L2 Cache(集中共有メモリあるいはL2キャッシュ )
IntraCCN (チップ内結合網: 複数バス、クロスバー等 )
InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 )
プリケ ション共用可
•高信頼性
•半導体集積度と共に性能向上
プリケ ション共用可
•高信頼性
•半導体集積度と共に性能向上
マルチコア統合ECU
,,
CSM / L2 Cache(集中共有メモリあるいはL2キャッシュ )
IntraCCN (チップ内結合網: 複数バス、クロスバー等 )
InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 )
CSM / L2 Cache(集中共有メモリあるいはL2キャッシュ )
IntraCCN (チップ内結合網: 複数バス、クロスバー等 )
InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 )
プリケ ション共用可
•高信頼性
•半導体集積度と共に性能向上
プリケ ション共用可
•高信頼性
•半導体集積度と共に性能向上
マルチコア統合ECU
,,・Low power consumption・Scalable performance improvement with the advancement of semiconductor
OSCAR Parallelizing Compiler• Improve effective performance, cost-performance
and productivity and reduce consumed power– Multigrain Parallelization
• Exploitation of parallelism from the whole program by use ofcoarse-grain parallelism among loops and subroutines,coarse grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism
Data Localization– Data Localization• Automatic data distribution for distributed shared memory, cache
and local memory on multiprocessor systems.– Data Transfer Overlapping
• Data transfer overhead hiding by overlapping task execution anddata transfer using DMA or data pre-fetchingdata transfer using DMA or data pre fetching
– Power Reduction• Reduction of consumed power by compiler control of frequency,
9voltage and power shut down with hardware supports.
Generation of Coarse Grain Tasks
Macro-tasks (MTs)Bl k f P d A i t (BPA) B i Bl k (BB)Block of Pseudo Assignments (BPA): Basic Block (BB)Repetition Block (RB) : outermost natural loop Subroutine Block (SB): subroutineSubroutine Block (SB): subroutine
BPA Near fine grain parallelization BPABPA Near fine grain parallelization
Loop level parallelization BPA
BPARBSBBPARBProgram RB Near fine grain of loop body
Coarse grainparallelization
RBSB
BPA
RBSBBPARBSBSB Coarse grain
parallelization
BPARBSB
SBBPARBSB
101st. Layer 2nd. Layer 3rd. LayerTotalSystem
Earliest Executable Condition Analysis for coarse grain tasks (Macro-tasks)
1 BPABPA
Data Dependency
Control flow
Conditional branch
Block of Psuedo
1
3 BPA2 BPA
4 BPA
BPA
RB Repetition Block
RB
Block of PsuedoAssignment Statements
7
2 3
4 8BPA
5 BPA
6 RB
7 RB15 BPA
RB
RB
RB
BPA
7
11
4
5 6
8
9 10
6
BPA8 BPA
9 BPA 10 RB
11 BPA
RB 1115 7
12
13
Data dependency
Extended control dependency11 BPA
12 BPA
13 RB14
13Extended control dependencyConditional branch
ORAND
Original control flowA Macro Flow Graph
1114 RB
ENDA Macro Task Graph
Automatic processor assignment in 103 su2cor103.su2cor
• Using 14 processorsg p– Coarse grain parallelization within DO400 of subroutine LOOPS
12
MTG of Su2cor-LOOPS-DO400MTG of Su2cor-LOOPS-DO400Coarse grain parallelism PARA_ALD = 4.3
13DOALL Sequential LOOP BBSB
Data-Localization L Ali d D itiLoop Aligned Decomposition
• Decompose multiple loop (Doall and Seq) into CARs and LRs id i i t l d t d dconsidering inter-loop data dependence.
– Most data in LR can be passed through LM.– LR: Localizable Region, CAR: Commonly Accessed Region
LR CAR CARLR LRC RB1(Doall)DO I=1,101
DO I=69,101DO I=67,68DO I=36,66DO I=34,35DO I=1,33
DO I=1,33C RB2(Doseq)
A(I)=2*IENDDO
DO I=67 67
DO I=35,66
DO I=34,34( q)
DO I=1,100B(I)=B(I-1)
+A(I)+A(I+1)ENDDO
DO I=2 34
DO I=68,100
DO I=67,67
DO I=68 100DO I=35 67
RB3(Doall)DO I=2,100
C(I)=B(I)+B(I-1)
14
DO I=2,34 DO I=68,100DO I=35,67C(I)=B(I)+B(I-1)ENDDO
C
Inter loop data dependence analysis in TLGInter-loop data dependence analysis in TLG• Define exit-RB in TLG
as Standard-Loop• Find iterations on which
C RB1(Doall)DO I=1,101A(I)=2*I
ENDDOI(RB1)
K K+1K-1
a iteration of Standard-Loop is data dependent
K f RB3 i d dC RB2(Doseq)
DO I=1 100
ENDDO
– e.g. Kth of RB3 is data-dep on K-1th,Kth of RB2,on K 1 K K+1 of RB1
DO I 1,100B(I)=B(I-1)
+A(I)+A(I+1)ENDDO
I(RB2)KK-1
on K-1th,Kth,K+1th of RB1C RB3(Doall)
DO I=2,100C(I) B(I) B(I 1) I(RB3)C(I)=B(I)+B(I-1)
ENDDO
Example of TLG
I(RB3)K
15
Example of TLG
Decomposition of RBs in TLGp
• Decompose GCIR into DGCIRp(1≦p≦n)– n: (multiple) num of PCs, DGCIR: Decomposed GCIR
• Generate CAR on which DGCIRp&DGCIRp+1 are data-dep.• Generate LR on which DGCIRp is data dep• Generate LR on which DGCIRp is data-dep.
I(RB1)10199 1002 31 66 6733 34 683635 65
RB11 RB12 RB13RB1<1,2> RB1<2,3>
I(RB2)1009921 67663433 35 65
RB21
RB31 RB32
RB22 RB23
RB33
RB2<1,2> RB2<2,3>
I(RB3)1002 34 6735 66RB3 RB3 RB3
DGCIR1 DGCIR2 DGCIR3
16GCIR
Data Localization11
2
1
32
PE0 PE1
12 1
3 4 56
23 45
6 7 8910 1112
3
4
2
6 7
8
14
18
7
6 7 8910 1112
1314 15 16
5
8
9
11
15
18
19
25
8 9 101718 19 2021 22
10
11
13 16
25
29
112324 25 26
dlg0dlg3dlg1 dlg2
17 20
21
22 26
30
12 1314
15
2728 29 3031 32
33Data Localization Group
dlg023 24
27 28
32
17MTG MTG after Division A schedule for two processors
15 33Data Localization Group31
An Example of Data Localization for Spec95 SwimDO 200 J=1 N cache sizeDO 200 J=1 N cache sizeDO 200 J=1,NDO 200 I=1,M
C h li fli tDO 300 I=1,MUOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J))POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J)) (b) Image of alignment of arrays on
Cache line conflicts occurs among arrays which share the same location on cache
DO 300 I=1,MUOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J))POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J)) (b) Image of alignment of arrays on
Cache line conflicts occurs among arrays which share the same location on cache
18
( , ) ( , ) ( ( , ) ( , ) ( , ))300 CONTINUE
(a) An example of target loop group for data localization
(b) Image of alignment of arrays on cache accessed by target loops
( , ) ( , ) ( ( , ) ( , ) ( , ))300 CONTINUE
(a) An example of target loop group for data localization
(b) Image of alignment of arrays on cache accessed by target loops
Data Layout for Removing Line Conflict Misses by Array Dimension Paddingby Array Dimension Padding
– Data LocalizationData transfer– Data transfer Overlapping
– Complier Controlled Power Saving SchemePower Saving Scheme
• Compiler cooperative Multi-core architecturearchitecture– OSCAR Multi-core
ArchitectureOSCAR Heterogeneous– OSCAR Heterogeneous Multiprocessor Architecture
• Commercial SMP46
• Commercial SMP machines
Performance of OSCAR Compiler Using Memory Management API on SGI Altix450Memory Management API on SGI Altix450
Montecito CC-NUMA Server
Using SPEC95 tomcatv
7i t l il 9 1
4
5
6
パイ
ラに
対す
る上
率
intel compiler ver9.1本手法(DSM:4MB)本手法(DSM:2MB)
6.25倍
2
3
4
telコ
ンパ
次処
理に
速度
向上
0
1
2
int
逐次 速
471 2 4 8 16
プロセッサ数
Processor Block DiagramProcessor Block DiagramCCN:Cache controllerIL DL I t ti /D t
Core #3CPU FPUCore #2
600MHzIL, DL: Instruction/Data
local memory
I$32K
D$32K
IL DL
CPU FPU
CCNI$32K
D$32K
CPU FPUCore #1
$ $CPU FPUCore #0
CPU FPU rolle
r
URAM: User RAMGCPG: Global clock pulse generator
IL8K
DL16K
CCN
URAM 128K
32K 32KIL8K
DL16K
CCNI$32K
D$32K
IL8K
DL16K
CCNI$32K
D$32K
CPU FPU
CCN
p co
ntr
LCPGgLCPG: Local CPG for each coreLBSC: SRAM controller
URAM 128K8K 16KURAM 128K
IL8K
DL16K
URAM 128K Snoo
p(S
NC
)LCPG0-3LCPG0-3LCPG
0-3LCPG0-3
On-chip system bus (SHwy)
LBSC: SRAM controllerDBSC: DDR2 controller
300MHzS (0 3
DBSC PCIExp
HWGCPGIPLBSC
484 laneDDR2
32bitSRAM32bit
Chip OverviewProcess Technology
90nm, 8-layer, triple-Vth, CMOS
Chip Size 97 6mm2 (9 88mm x 9 88mm)Core Core Chip Size 97.6mm (9.88mm x 9.88mm)
Supply Voltage 1.0V (internal), 1.8/3.3V (I/O)
Power Consumption
0.6 mW/MHz/CPU @ 600MHz (90nm G)
#0 PeripheralsCore
#1
Consumption 600MHz (90nm G)
Clock Frequency 600MHz
CPU Performance 4320 MIPS (Dhrystone 2.1)
SNC
Core Core
#3FPU Performance 16.8 GFLOPS
I/D Cache 32KB 4way set-associative (each)
#2#3
( )
ILRAM/OLRAM 8KB/16KB (each CPU)
URAM 128KB (each CPU)
P k FCBGA 554 i 29SH4A Multicore SoC Chip Package FCBGA 554pin, 29mm x 29mm
ISSCC07 Paper No.5.3, Y. Yoshida, et al., “A 4320MIPS Four-Processor Core SMP/AMP with
49
pIndividually Managed Clock Frequency for Low Power Consumption”
Performance on a DevelopedSH Multi core (RP1: SH X3)SH Multi-core (RP1: SH-X3)
Using Compiler and APIAudio AAC* Encoder Image Susan Smoothing
up
5.0
4.02 95
3.825.0
4.03.43
d u
p
Speed
3.01.91
2.95
3.01.86
2.70
Speed
2.0
1.0
1.002.0
1.0
1.00
Number of Processors1 2 3 4
01 2 3 4
0Number of Processors
Page. 50
Number of Processors**) Mibench Embedded application benchmark by Michigan Univ.*) ISO Advanced Audio Coding:
Number of Processors
OSCAR Heterogeneous MulticoreOSCAR Heterogeneous Multicore
• OSCAR Type Memory• OSCAR Type Memory Architecture
• LPM– Local Program Memoryg y
• LDM– Local Data Memory
• DSMDi ib d Sh d M– Distributed Shared Memory
• CSM– Centralized Shared Memory
• On Chip and/or Off Chipp p• DTU
– Data Transfer Unit• Interconnection Network
M lti l B– Multiple Buses– Split Transaction Buses– CrossBar …
51
Static Scheduling of Coarse Grain Tasks for a Heterogeneous Multi-core
MT1for CPU
CPU0 CPU1 DRP
MT1MT2for CPU
MT3for CPU
MT1
MT2 MT3MT6for CPU
MT5for DRP
MT4for DRP
MT7 MT10MT9MT8
MT4MT5
MT6 MT7MT8MT7
for DRPMT10for DRP
MT9for DRP
MT8for CPU
MT13f DRP
MT12f DRP
MT11f CPU
MT8MT9MT10
MT11for DRPfor DRPfor CPU
EMT
MT12MT13
52
An Image of Static Schedule for Heterogeneous Multi-core with Data Transfer Overlapping and Power Controlpp g
TIM
EE
53
Compiler Performance on a OSCAR HeteroCompiler Performance on a OSCAR Hetero--multimulti--corecore25.2 times speedup using 4 SH general purpose cores and 4 DRP p p g g p paccelerators against a single SH core (Comparable Performance with 3GHz high performance processor by 300MHz low power multicore)
16.3016.25
17.2217.1016.13 16.47
16.0
18.0
20.0
25 2
41.3 times Speedup
9 0810.05
9 0810.09
9 1010.11
10.3410.3110.17
10 0
12.0
14.0
25.2 times
Speedup against
5.176.09
9.08
5.186.10
9.08
5.196.11
9.10
6.0
8.0
10.0 against 1SH core
0.40
1.59
2.90
0.40
1.59
2.90
0.000.00
2.90
1.00
0.0
2.0
4.0
U U DR
P
DR
P
DR
P
DR
P
DR
P
DR
P
DR
P
DR
P
U U DR
P
DR
P
DR
P
DR
P
DR
P
DR
P
DR
P
DR
P
U U DR
P
DR
P
DR
P
DR
P
DR
P
DR
P
DR
P
DR
P
lock
Cyc
les
1C
PU 1
CP
4C
P
2C
PU
+1D
2C
PU
+2D
4C
PU
+2D
2C
PU
+4D
4C
PU
+4D
8C
PU
+4D
4C
PU
+8D
8C
PU
+8D
1C
P
4C
P
2C
PU
+1D
2C
PU
+2D
4C
PU
+2D
2C
PU
+4D
4C
PU
+4D
8C
PU
+4D
4C
PU
+8D
8C
PU
+8D
1C
P
4C
P
2C
PU
+1D
2C
PU
+2D
4C
PU
+2D
2C
PU
+4D
4C
PU
+4D
8C
PU
+4D
4C
PU
+8D
8C
PU
+8D
Exe
cutio
n C
l
SH4ACore
SH4ACore
SH4ACore
汎用高性能プロセッサ
SH4A CoreOn-Chip CSM
STB×1
SH4A CoreOn-Chip CSM
BUS×3
SH4A CoreOn-Chip CSM
STB×3
Power Reduction by OSCAR Compiler(4SHs+4DRPs)0 78 W 22% P d ti b C il C t l0.78 W: 22% Power reduction by Compiler Control
1.2-22%
0 8
1
[W]
DRP
1.01
0 78
1.01
0 78
1.01
0 78
1.01
0 78
1.01
0 78
1.01
1 78
22%
0.6
0.8
sum
ptio
n [ DRP
Clock
FPU
IEU
0.78 0.78 0.78 0.78 0.78 1.78
0.4
Pow
er
Cons
BPU
Register
0
0.2
Avera
ge P
FVOFF
FVON
FVOFF
FVON
FVOFF
FVON
FVOFF
FVON
FVOFF
FVON
FVOFF
FVON
STB×1 Bus×3 STB×3 STB×1 Bus×3 STB×3
2007.7.30 55Off Chip On Chip
Compiler cooperative low power high effectiveConclusions
Compiler cooperative low power high effective performance multi-core processors will be more important in wide range of information systems fromimportant in wide range of information systems from games, mobile phones, automobiles to peta-scale supercomputers. p pParallelizing compilers are essential for realization of
Good cost performancepShort hardware and software development periodsLow power consumptionp pHigh software productivityScalable performance improvement with advancement p pin semiconductor integration technology
Key technologies in multi-core compiler
56Multigrain parallelization, Data localization, Data transfer overlapping using DMA, Low power control technologies