1 Customizable Domain Customizable Domain- Specific Computing Specific Computing Jason Cong Center for Domain-Specific Computing UCLA Computer Science Department [email protected]http://cadlab.cs.ucla.edu/~cong 2 The Power Barrier The Power Barrier … Source : Shekhar Borkar, Intel
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Focus: New Transformative Approach to Focus: New Transformative Approach to Power/Energy Efficient ComputingPower/Energy Efficient Computing
Parallelization
Source: Shekhar Borkar, Intel
Current Solution: ParallelizationCurrent Solution: Parallelization
4
Cost and Energy are Still a Big Issue Cost and Energy are Still a Big Issue ……
Cost of computing•HW acquisition
•Energy bill
•Heat removal
•Space
•…
5
Next Significant Opportunity Next Significant Opportunity ---- CustomizationCustomization
Parallelization
Source: Shekhar Borkar, Intel
Customization
Adapt the architecture to
Application domain
6
MotivationMotivationA few factsA few facts
We have sufficient computing power for most applicationsEach user/enterprise need high computing power for only selected tasks in its domainApplication-specific integrated circuits (ASIC) can lead to 10,000X+ better power performance efficiency, but are too expensive to design and manufacture
Our proposalOur proposalA general, customizable platform for the given domain(s)
• Can be customized to a wide-range of applications in the domain• Can be massively produced with cost efficiency• Can be programmed efficiently with novel compilation and runtime systems
Goal: Goal: A A ““supercomputersupercomputer--inin--aa--boxbox”” with 100X performance/power improvement via with 100X performance/power improvement via customization for the intended customization for the intended domain(sdomain(s))
Analogy: Analogy: Advance of civilization via specialization/customizationAdvance of civilization via specialization/customization
7
Example Application Domain: HealthcareExample Application Domain: HealthcareMedical imaging has transformed healthcareMedical imaging has transformed healthcare
An in vivo method for understanding disease development and patient conditionEstimated to be $100 billion/yearMore powerful & efficient computation can help
• Fewer exposures using compressive sensing• Better clinical assessment (e.g., for cancer) using
improved registration and segmentation algorithms
HemodynamicHemodynamic simulation simulation Very useful for surgical procedures involving blood flow and vasculature
Both may take hours to days to constructBoth may take hours to days to constructClinical requirement: 1Clinical requirement: 1--2 min2 min
Cloud computing wonCloud computing won’’t work t work ––•• Communication, realCommunication, real--time requirement, privacytime requirement, privacy
A megawattA megawatt--datacenter for each hospital?datacenter for each hospital? Intracranial aneurysm reconstruction with hemodynamics
Magnetic resonance (MR) angiograph of an aneurysm
8
compressive sensing
level set methods
fluid registration
total variationalalgorithm
Medical Image Processing PipelineMedical Image Processing Pipeline
deno
ising
deno
ising
regis
tratio
nre
gistra
tion
segm
entat
ionse
gmen
tation
analy
sisan
alysis
h
zyS
i,jvolumevoxel
ji
S
kkk
eiZ
wjfwi
∑
=−⎟⎟
⎠
⎞
⎜⎜
⎝
⎛=∀
=−
−
∈∑
1
21
2
j
2, )(
1 ,2)()(u :voxel σ
( ) [ ] )()()()( uxTxRuxTvv
uvtuv
−∇−−−=⋅∇∇++Δ
∇⋅+∂∂
=
ημμ
{ }0t)(x, : xvoxels)(surface
div),(F
==
⎥⎥⎦
⎤
⎢⎢⎣
⎡⎟⎟⎠
⎞⎜⎜⎝
⎛
∇∇
+∇=∂∂
ϕ
ϕϕλφϕϕ
t
datat
∑∑==
+∂
∂+
∂∂
−=∂∂
+∂
∂
+Δ+−∇=∇⋅+∂∂
3
12
23
1),(
),()(
ji
j
ij
j ij
ij
i txfxvv
xp
xvv
tv
txfvpvvtv
υ
υ
reco
nstru
ction
reco
nstru
ction
∑∑∀
+
<<
voxels
2
points sampled)(-ARmin
:theoryNyquist -Shannon classical rate aat sampled be can and sparsity,exhibit images Medical
ugradSuu
λ
Navier-Stokesequations
9
compressive sensing
level set methods
fluid registration
total variationalalgorithm
Application Domains: Medical Image Processing PipelineApplication Domains: Medical Image Processing Pipelinede
noisi
ngde
noisi
ngre
gistra
tion
regis
tratio
nse
gmen
tation
segm
entat
ionan
alysis
analy
sisre
cons
tructi
onre
cons
tructi
on
Navier-Stokesequations
non-iterative, highly parallel, local & global communication sparse linear algebra, structured grid, optimization methods
parallel, global communicationdense linear algebra, optimization methods
local communicationsparse linear algebra, n-body methods, graphical models
local communicationdense linear algebra, spectral methods, MapReduce
iterative, local or global communicationdense and sparse linear algebra, optimization methods
•• These algorithms have diverse These algorithms have diverse computation & computation & communication patterns communication patterns
•• A single homogenous system A single homogenous system can not perform very well on can not perform very well on all these algorithmsall these algorithms
10
compressive sensing
level set methods
fluid registration
total variationalalgorithm
Navier-Stokesequations
Non-iterative, highly parallel, local & global communication sparse linear algebra, structured grid, optimization methods
parallel, global communicationdense linear algebra, optimization methods
local communicationsparse linear algebra, n-body methods, graphical models
local communication dense linear algebra, spectral methods, MapReduce
iterative, local or global communicationdense and sparse linear algebra, optimization methods
Need of Customization for Medical Image Processing PipelineNeed of Customization for Medical Image Processing Pipeline
deno
ising
deno
ising
regis
tratio
nre
gistra
tion
segm
entat
ionse
gmen
tation
analy
sisan
alysis
reco
nstru
ction
reco
nstru
ction
•• These algorithms have diverse These algorithms have diverse computation & communication computation & communication patternspatterns
•• A single, homogeneous system A single, homogeneous system cannot perform very well on all cannot perform very well on all of these algorithmsof these algorithms
•• Need architecture Need architecture customization and hardwarecustomization and hardware--software cosoftware co--optimizationoptimization
•• Include many common Include many common computation kernels (computation kernels (““motifsmotifs””))•• Applicable to other domainsApplicable to other domains
BiBi--harmonic registration (Using the same algorithm on all harmonic registration (Using the same algorithm on all platforms)platforms)
CPU (Xenon 2.0 GHz)CPU (Xenon 2.0 GHz)
1x 1x
~100 W~100 W
GPU (Tesla C1060)GPU (Tesla C1060)
93x93x
~150 W~150 W
FPGA (xc4vlx100) FPGA (xc4vlx100)
11x 11x
~5W~5W
3D median filter: For each 3D median filter: For each voxelvoxel, compute the median of , compute the median of the 3 x 3 x 3 neighboring the 3 x 3 x 3 neighboring voxelsvoxels
CPU (Xenon 2.0 GHz)CPU (Xenon 2.0 GHz)
Quick select Quick select
1x 1x
~100 W~100 W
GPU (Tesla C1060)GPU (Tesla C1060)
Median of medians Median of medians
70x 70x
~140 W~140 W
FPGA (xc4vlx100) FPGA (xc4vlx100)
BitBit--byby--bit majority voting bit majority voting
1200x 1200x
~3 W~3 W
11
12
Center for Domain-Specific Computing (CDSC) Organization
Reinman(UCLA)
Palsberg(UCLA)
Sadayappan(Ohio-State)
Sarkar(Associate Dir)
(Rice)
Vese(UCLA)
Potkonjak (UCLA)
• A diversified & highly accomplished faculty team: 8 in CS&E; 1 in EE; 2 in medical school; 1 in applied math
• 15-20 postdocs and graduate students in four universities – UCLA, Rice, Ohio-State, and UC Santa Barbara
Architectural and Frequency Adaptations for Multimedia Applications (cont’d)
Important conclusionsDVS gives the most of energy reductionArchitectural adaption further reduce energy when augmented on DVSWithout DVS, less aggressive architectures are more energy-efficientWith DVS, more aggressive architectures are often more energy-efficient• The higher IPC of the more aggressive architectures means it an be run at a
Key findingsUp to 5.3x improvement in efficiency through adaptationRelatively frequent adaptation (80K instruction intervals) needed to achieve maximum efficiency
Key findingsOn average, adapting 3 parameters is sufficient to achieve 77% of efficiency gain• However, the 3 parameters depend on application and phase
DVFS provides relatively less benefits (in terms of efficiency) with architecture adaptations
25
Existing Studies on Cores Customization (not domain-specific)
Less than 50% worse than ideal 8x powerful core. Up to 40% improvement for changing workloads
Core spilling – spill from 1 core up to 8 cores[Cong et al Trans on PDS 2007]
Less than 30% and 20% worse for sequential and parallel benchmarks respectively
Core fusion – 2-issue cores fused to simulate 4 and 6 issue cores[Ipel et al ISCA 2007]
16% total processor energy saving Issue logic and Issue Queue (43/58)[Folegnani & Gonzalez, ISCA 2001]
Power saving of 59% for the three components
Instruction Queue (17/32) Reorder Buffer (57/128)
Load/Store Queue (18/32)
[Ponomarev, et.al., MICRO 2001]
Up to 78% total energy saving with combined DVS and architectural adaption
Issue Width (8,4,2) Issue Queue (128,64,32)
Function Units (4,2) Dynamic Voltage Scaling
[Hughes, et.al., MICRO 2001]
Up to 50% power reduction and 55% performance improvement
Some Recent Studies -- Automatic Memory PartitioningTo appear in ICCAD 2009Memory system is critical for high performance and low power design
Memory bottleneck limits maximum parallelismMemory system accounts for a significant portion of total power consumption
GoalGiven platform information (memory port, power, etc.), behavioral specification, and throughput constraints• Partition memories automatically • Meet throughput constraints• Minimize power consumption
A[iA[i]] A[i+1]A[i+1]
for (for (intint i =0; i < n; i++)i =0; i < n; i++)…… = A[i]+A[i+1]= A[i]+A[i+1]
(a) C code
R1 R2
A[0, 2, 4,A[0, 2, 4,……]] A[1, 3, 5A[1, 3, 5……]]
Decoder
(b) Scheduling
(c) Memory architecture after partitioning
33
Automatic Memory Partitioning (AMP)Techniques
Capture array access confliction in conflict graph for throughput optimizationModel the loop kernel in parametric polytopes to obtain array frequency
ContributionsAutomatic approach for design space explorationCycle-accurate Handle irregular array accessesLight-weight profiling for power optimization
Behavioral & CommunicationBehavioral & CommunicationSynthesis and OptimizationsSynthesis and Optimizations
AutoPilotTM
Com
mon Testbench
User ConstraintsUser Constraints
ESL Synthesis
Design Specification
Platform-based C to FPGA synthesisSynthesize pure ANSI-C and C++, GCC-compatible compilation flowFull support of IEEE-754 floating point data types & operationsEfficiently handle bit-accurate fixed-point arithmeticMore than 10X design productivity gainHigh quality-of-results
36
Some Other Usage of AutoPilot (Microsoft)On John Cooley’s DeepChip 6/30/09
http://www.deepchip.com/items/0482-06.html
“We purchased AutoESL's AutoPilot in 2008 to implement some of the time- consuming cores in our software into FPGA hardware for the runtime speed-up improvements…
1. RankBoost - a machine-learning algorithm used in the dynamic ranking of search engines…
2. Sorting Algorithm - also several thousand lines of OO C++ code with 138 lines that needed speeding up…
37
CHP Creation CHP Creation –– Design Space ExplorationDesign Space Exploration
Key questions: Optimal trade-off between efficiency & customizabilityWhich options to fix at CHP creation? Which to be set by CHP mapper?
Custom instructions & acceleratorsAmount of programmable fabric Shared vs. private acceleratorsCustom instruction selectionChoice of accelerators…
Custom instructions & acceleratorsAmount of programmable fabric Shared vs. private acceleratorsCustom instruction selectionChoice of accelerators…
Core parametersFrequency & voltageDatapath bit widthInstruction window sizeIssue widthCache size & configurationRegister file organization# of thread contexts…
Core parametersFrequency & voltageDatapath bit widthInstruction window sizeIssue widthCache size & configurationRegister file organization# of thread contexts…
NoC parametersInterconnect topology # of virtual channelsRouting policyLink bandwidthRouter pipeline depthNumber of RF-I enabled
routersRF-I channel and
bandwidth allocation…
NoC parametersInterconnect topology # of virtual channelsRouting policyLink bandwidthRouter pipeline depthNumber of RF-I enabled
Current On-Chip Interconnect TechnologyOptimized RC lines with repeaters
Wiresizing, buffer insertion, buffer sizing …E.g. UCLA Tio and IPEM packages
Reconfigurable interconnectsFor FPGAs: • RC busses with pass-transistors or bi-directional buffers
For CMPs (chip multi-processors)• Mesh-like network-on-chip (NoC)
Pay a large penalty on performance
3939
Used vs. Available Bandwidth in Modern CMOS
@ 45nm CMOS TechnologyData Rate: 4 Gbit/sfT of 45nm CMOS can be as high as 240GHzBaseband signal bandwidth only about 4GHz98.4% of available bandwidth is wasted
Question: How to take advantage of full-bandwidth of modern CMOS?
CMOS Voltage Controlled Oscillator, measured with a subharmonicmixer and driven with a 80 GHz synthesizer local oscillator. The mixing frequency is (fVCO - 4*fLO)=fIF, or fVCO -4*(80 GHz)= 3.5 GHz, yielding fVCO= 323.5 GHz!
On-Wafer VCO Test Setup at JPL
CMOS VCO designed by Frank Chang’s group at UCLA, fabricated in 90nm process
323.5GHz VCO
*Huang, D., LaRocca T., Chang, M.-C. F., “324GHz CMOS Frequency Generator Using Linear Superposition Technique IEEE International Solid-State Circuits Conference (ISSCC), 476-477, (Feb 2008) San Francisco, CA
4141
Multiband RF-Interconnect
• In TX, each mixer up-converts individual baseband streams into specific frequency band (or channel)
• N different data streams (N=6 in exemplary figure above) may transmit simultaneously on the shared transmission medium to achieve higher aggregate data rates
• In RX, individual signals are down-converted by mixer, and recovered after low-pass filter
Sig
nal S
pect
rum
Sig
nal P
ower
Sig
nal P
ower
Sign
al P
ower
Sig
nal P
ower
4242
Tri-band On-Chip RF-I Test Results
30GHz Channel50 GHz Channel
30GHz Channel
50GHz Channel
Base Band Channel
Process IBM 90nm CMOS Digital Process
Total 3 Channels 30GHz, 50GHz, Base Band
Data Rate in each channel
RF Band: 4Gbps
Base Band: 2Gbps
Total Data Rate 10Gbps
Bit Error Rate Across all Bands <10E‐9
Latency 6 ps/mm
Enegry Per Bit (RF) 0.09*pJ/bit/mm
Enegry Per Bit (BB) 0.125pJ/bit/mm
Data Output waveform Output Spectrum of the RF-Bands, 30GHz and 50GHz
*VCO power (5mW) can be shared by all (many tens) parallel RF-I links in NOC and does not burden individual link significantly.
4343
Comparison between Repeated Bus and Multi-band RF-I @ 32nm
Assumptions:1. 32nm node; 30x repeater,
FO4=8ps, Rwire = 306Ω/mm Cwire = 315fF/mm, wire pitch=0.2um, Bus length = 2cm, f_bus = 1GHz, Bus Width 96Byte
2. Repeaters Area = 0.022mm2
3. Bus physical width = 160um
4. In that width we can fit 13 transmission line, each with 7 carriers with carrying 8Gbps
Interconnect length = 2cm
RF‐I
Repeated
Bus# of wire 13 448
Data rate per carrier
(Gbit/s) 8 NA# of carrier 7 NA
Data rate per carrier
(Gbit/s) 56 1
Aggregate Data Rate 728 768Bus Physical Width 160 160
Transceiver Area (mm2) 0.27 0.022Power (mW) 455 6144
Energy per bit (pJ/bit) 0.63 8
4444
Architectural Impact Using RF-I
High bandwidth communicationData distribution across many-core topologiesVital in keeping many-core designs active
Low latency communicationEnables users to apply parallel computing to a broader applications through faster synchronization and communicationFaster cache coherence protocols
ReconfigurabilityAdapt NoC topology/bandwidth to the needs of the individual application
CHP Mapping CHP Mapping –– Compilation and Runtime Software Systems Compilation and Runtime Software Systems for Customizationfor Customization
Goals: Efficient mapping of domain-specific specification to customizable hardware– Adapt the CHP to a given application for drastic performance/power efficiency improvement
FCUDA: CUDA-to-FPGA (Best Paper Award at SASP 2009)
Use CUDA in tandem with High-Level Synthesis (HLS) to:enable high-level abstraction for FPGA programmingexploit massively parallel compute capabilities of FPGAfacilitate single interface for GPU and FPGA kernel acceleration
CUDA: C-based parallel programming model for GPUsconcise expression of coarse grained parallelismvery popular (wide range of existing applications)Explicit partitioning and trasnfer of data between off-chip and on-chip memory
AutoPilot: Advanced HLS tool (from AutoESL)Platform-specific (i.e. FPGA/ASIC) C-to-RTL mappingFine-grained and loop iteration parallelism extractionAnnotated coarse-grained parallelism extraction• Requires explicit expression and annotation from programmer
54
CUDA-to-AutoPilot C TranslationIdentify off-chip data transfers
aggregate multi-thread off-chip accesses into DMA bursts
Split kernel into computation and data communication tasksUse thread-block granularity for splitting kernel threads into parallel FPGA coresAllocate data storage based on following memory space mapping:
GPU FPGA• Global Off-chip DRAM• Shared On-chip BRAMs• Constant/Texture Registers• Registers / Local Memory
thread-block kernel tasks
55
Results
Benchmark Core # DRAM Bandwidth Limiting Resource
matmul 32bit 128 3.5GB/s DSP
matmul 16bit 176 1.6GB/s BRAM
matmul 8bit 176 0.8GB/s BRAM
cp 32bit 25 0.128GB/s DSP
cp 16bit 96 0.19GB/sec DSP
cp 8bit 96 0.1GB/sec DSP
rc5-72 32bit 80 ≈ 0GB/sec LUT
Kernel Configuration Description
Matrix Multiply (matmul)
1024x1024Common kernel in many imaging, simulation, and scientific application
Coulombicpotential (cp)
4000 atoms, 512x512 grid
Computation of electric potential in a volume containing charged atoms
RSA Encryption (rc5-72)
4 Billion KeysBrute force encryption key generation and matching
00.5
11.5
22.5
32bit 16bit 8bit 32bit 16bit 8bit 32bit
matmul cp rc5-72
spee
dup
GPUFPGA
Benchmark GPU GeForce 8800
FPGA Virtex5 xc5vfx200t
FPGA over GPU Benefit
matmul32bit
≈ 100 Watt
10.622 Watt 9.41X
matmul16bit
10.559 Watt 9.47X
matmul 8bit 9.954 Watt 10.05X
Speedup comparable to GPU in several configurations Much more power efficient than GPU!
Assume FPGA has high bandwidth bus to off-chip DDR
56
Concluding RemarksWe believe that domainWe believe that domain--specific customization is the next specific customization is the next transformative approach to energy efficient computingto energy efficient computing
Beyond parallelization?
Many research opportunities and challengesMany research opportunities and challengesDomain-specific modeling/specificationNovel architecture & microarchitecture for customizationCompilation and runtime software to support intelligent customizationNew research in testing, verification, reliability in customizable computing
CDSC is taking a highly integrated effort CDSC is taking a highly integrated effort ––Coordinated crossCoordinated cross--layer customization in modeling, HW, SW, & application layer customization in modeling, HW, SW, & application developmentdevelopment
57
Acknowledgements
Reinman(UCLA)
Palsberg(UCLA)
Sadayappan(Ohio-State)
Sarkar(Associate Dir)
(Rice)
Vese(UCLA)
Potkonjak (UCLA)
•A highly collaborative effort
• thanks to all my co-PIs in four universities – UCLA, Rice, Ohio-State, and UC Santa Barbara
• Thanks the support from the National Science Foundation