High-Performance Arithmetic High-Performance Arithmetic Challenges: Challenges: From Architectures to Circuits From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel Labs Intel Corporation, Hillsboro, OR, USA [email protected]Prof. Vojin Oklobdzija ACSEL Lab, Dept. of ECE University of California, Davis, CA, USA [email protected]Intel Labs EEE International Computer Arithmetic Symposium, Santiago, June 18 EEE International Computer Arithmetic Symposium, Santiago, June 18 th th 2003 2003
61
Embed
High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
1616thth IEEE International Computer Arithmetic Symposium, Santiago, June 18 IEEE International Computer Arithmetic Symposium, Santiago, June 18 thth 2003 2003
2
Motivation Design choices for high-performance circuits SOI vs. Bulk devices: ALU design test-case
64-bit ALUs in PD-SOI and Bulk CMOS Energy-efficient high-performance AGU/ALUs
4GHz Sparse-tree AGU Design 6.5-10GHz Integer ALU Design
Summary
Motivation Design choices for high-performance circuits SOI vs. Bulk devices: ALU design test-case
64-bit ALUs in PD-SOI and Bulk CMOS Energy-efficient high-performance AGU/ALUs
4GHz Sparse-tree AGU Design 6.5-10GHz Integer ALU Design
Summary
OutlineOutline
3
Frequency doubles every generation Performance-critical units
ALUs & AGUs Register files, L0 caches
High-performance trendsHigh-performance trends
Single-cycle latency &
throughput
0.1
1
10
100
1000
10000
100000
1970 1980 1990 2000 2010 2020
MHz
15-30 GHz
8080
8086
386 Pentium® proc
Pentium® 4 proc
64-bit ALUs in 0.1864-bit ALUs in 0.18m m PD-SOI/Bulk CMOS:PD-SOI/Bulk CMOS:
Design & Scaling TrendsDesign & Scaling Trends
64-bit ALUs in 0.1864-bit ALUs in 0.18m m PD-SOI/Bulk CMOS:PD-SOI/Bulk CMOS:
Design & Scaling TrendsDesign & Scaling Trends[S. Mathew et al, ISSCC 2001][S. Mathew et al, ISSCC 2001]
[S. Mathew et al, JSSC, Nov 2001][S. Mathew et al, JSSC, Nov 2001]
5
High performance devices: Partially depleted Silicon-on-Insulator Pros & Cons vs. bulk CMOS Scaling trends
High performance circuit design:Sparse-tree semi-dynamic AGUSingle-rail dynamic ALU
High performance devices: Partially depleted Silicon-on-Insulator Pros & Cons vs. bulk CMOS Scaling trends
High performance circuit design:Sparse-tree semi-dynamic AGUSingle-rail dynamic ALU
Design choicesDesign choices
6
p+ n+
PD-SOI DevicesPD-SOI Devices
Body of devices not tied to Vcc/Vss Body is isolated by buried oxideFloating Body!
P-Substrate
n+ n+ p+ p+STI
Buried Oxide
P type body N type body
ST
I
ST
I
7
Delay = Function of switching history– Capacitive coupling from S/G/D
– Impact Ionization, Diode conduction
– Transient Vbs DC Vbs
BackgateBuried Oxide
n+ n+
n+ Gate
Body Potential
S DG
Cbox
CdbCsb
Cgb
Complicates timing analysis
History Effect in PD-SOIHistory Effect in PD-SOI
8
64-bit ALU architecture64-bit ALU architecture
Ideal test-bed for evaluating process technologiesIdeal test-bed for evaluating process technologies
1200m Loopback bus
Single rail adder coreSingle rail adder core
Sum
2:1Mux
External operands
Shift control
5:1 Mux
0.5pF
9:1 Mux
Mux control
3:1 Mux
Mux control
9:1 Mux
External operands
Sign control
9
High-performance Adders: High-performance Adders: Kogge StoneKogge Stone
High-performance Adders: High-performance Adders: Kogge StoneKogge Stone
Generate all carries: Full-blown binary tree energy-inefficient
Non-critical Sum GeneratorNon-critical Sum GeneratorNon-critical Sum GeneratorNon-critical Sum Generator
Non-critical path: ripple carry chain Reduced area, energy consumption, leakage Generate conditional sums for each bit 1 in 4 carry selects appropriate sum
Non-critical Sum GeneratorNon-critical Sum GeneratorNon-critical Sum GeneratorNon-critical Sum Generator
Non-critical path: ripple carry chain Reduced area, energy consumption, leakage Generate conditional sums for each bit Sparse-tree carry selects appropriate sum
Pi Pi+1Pi+2 ,Gi+2
Sumi+1Sumi+2Sumi+3Sumi+3
XOR XORXOR XOR
Pi+3,Gi+3
Sumi
Su
mi ,1
Su
mi ,0
Carry
Gi+1
2:1 2:1 2:1
11 00
2:12:1
CMCM CMCM
CMCMCMCM CMCM
CMCMCMCM CMCMCMCM
XORXOR XORXOR
42
Conditional Carry for Cin=0Conditional Carry for Cin=0
Critical path: 7 gate stages same as KSSparse-tree: single-rail dynamicExploit non-criticality of sum generatorConvert to static logicSemi-dynamic design
31 3029 28 3 2 1 0Propagate/Generate/Partial Sum (dynamic)
Carry merge 0 (static)
Carry merge 1 (dynamic)
Carry merge 2 (static)
Carry merge 3 (dynamic)
Carry merge 4 (static)
Carry merge 5 (CSG) / Sum
84u
m lo
op
bac
k b
us
Sum Sum#
Han-Carlson ALU OrganizationHan-Carlson ALU Organization
•Single-rail dynamic 9-stage low-Vt design
56
Carry
iCarry#
i
gi#
Sumi
Psumi Sum# i
Odd-bit CSGCarry merge
Sum generation
gi-1#
2
pi#
Odd-bits CSG Sum GenerationOdd-bits CSG Sum Generation
• Final carry-merge CSG(dual-rail carry output)→ pass-transistor sum XOR
57
Even-bits CSG Sum GenerationEven-bits CSG Sum Generation
• Domino-compatible sum• Dual-rail sum from single-ended g inputs
Carry
iCarry#
i
gi#
Sumi
Psumi Sum #i
Even-bit CSGCarry merge
Sum generation
2
58
Die Micro-photographDie Micro-photograph
• 130nm 6-metal dual-Vt CMOS
• Scheduler:
• 210μm x 210μm
• ALU:
• 84μm x 336μm
Scheduler
ALU
59
Delay and Power MeasurementsDelay and Power Measurements
• 6.5GHz at 1.1V, 25ºC • Power: 120mW total, 15mW leakage• Scalable to 10GHz at 1.7V, 25ºC
0
50
100
150
200
250
300
350
400
450
0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
Supply Voltage (V)P
ower
(mW
)
0
50
100
150
200
250
300
350
400
450
Leak
age
Pow
er (m
W)
Design target
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
Supply Voltage (V)
Fm
ax (G
Hz)
25ºC25ºC
60
Area 50%
Performance (Delay)
10%
Active Leakage
40%
Robustness equal
Improvements Over Dual-rail Improvements Over Dual-rail DominoDomino
• Leakage reduced by eliminating dual-rail logic
• Robustness not compromised
• CSG improves both area and performance
61
SummarySummarySummarySummary4GHz AGU in 1.2V, 130nm technology4GHz AGU in 1.2V, 130nm technologySparse-tree adder architecture described 20% speedup and 56% energy reductionSemi-dynamic design:
Energy scales with switching activity Dual-Vt non-critical paths:
Low active leakage energy6.5GHz ALU and scheduler loop at 1.1V, 25ºC6.5GHz ALU and scheduler loop at 1.1V, 25ºC