Top Banner
Architecture and Design Automation for Application- Specific Processors Philip Brisk Assistant Professor Dept. of Computer Science and Engineering University of California, Riverside IEEE 9 th International Conference on ASIC (ASICON) Xiamen, China October 26, 2011
44

Architecture and Design Automation for Application-Specific Processors

Feb 23, 2016

Download

Documents

Julia Ackermann

Architecture and Design Automation for Application-Specific Processors. Philip Brisk Assistant Professor Dept. of Computer Science and Engineering University of California, Riverside. IEEE 9 th International Conference on ASIC (ASICON) Xiamen, ChinaOctober 26, 2011. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Architecture and Design Automation for Application-Specific Processors

Architecture and Design Automation for Application-Specific Processors

Philip BriskAssistant Professor

Dept. of Computer Science and EngineeringUniversity of California, Riverside

IEEE 9th International Conference on ASIC (ASICON)Xiamen, China October 26, 2011

Page 2: Architecture and Design Automation for Application-Specific Processors

Acknowledgment

The vast majority of slides in this presentation are taken from the Ph.D. Thesis of my friend and collaborator, Dr. Theo Kluter (Ph.D., EPFL, 2010)

Page 3: Architecture and Design Automation for Application-Specific Processors

Five Stage RISC Pipeline

I$ RF D$ RF

Fetch Decode Execute Memory Write-back

Page 4: Architecture and Design Automation for Application-Specific Processors

Application-Specific Custom Unit (ASCU) for Instruction Set Extensions (ISEs)

I$ RF D$ RF

Fetch Decode Execute Memory Write-back

ASCU

Page 5: Architecture and Design Automation for Application-Specific Processors

Automatic ISE Identification

I$ RF D$ RF

Fetch Decode Execute Memory Write-back

ASCU

CompilerHW Synthesis

Applications Assembly code with ISEs

Page 6: Architecture and Design Automation for Application-Specific Processors

Overview

• Architecture• Compilation and Synthesis• Conclusion

Page 7: Architecture and Design Automation for Application-Specific Processors

Overview

• Architecture– Custom ISE Logic– I/O Bandwidth– Local memories and coherence

• Compilation and Synthesis• Conclusion

Page 8: Architecture and Design Automation for Application-Specific Processors

Example: Luminance Conversion in JPEG Compression

• 19 cycles in software• 17-bit values

• Fixed-point

Page 9: Architecture and Design Automation for Application-Specific Processors

Custom Hardware Implementation

One single-ported memory• 4 – 5 cycles (3 loads, 1 arithmetic, 1 store)• Speedup: 3.8x – 4.8x

R, G, B, and Y Memories• 1 cycle for everything• Speedup: 19x

Page 10: Architecture and Design Automation for Application-Specific Processors

Custom ISE Logic

RF has 2 read ports RF has 1 write port

Architectural Limitations• Load data from memory into RF• RF I/O bandwidth

Performance• 7 cycles (3 loads, 2 ASCU, 1 store)• Speedup: 3.1x

Page 11: Architecture and Design Automation for Application-Specific Processors

Overview

• Architecture– Custom ISE Logic– I/O Bandwidth– Local memories and coherence

• Compilation and Synthesis• Conclusion

Page 12: Architecture and Design Automation for Application-Specific Processors

I/O Bandwidth Constraint• AES Algorithm

• Single round• 4 stages

• Best ISE• 22 inputs• 22 outputs• [Verma, Brisk, and Ienne,

CASES 2007 & TCAD 2010]

• RF I/O constraints• Noticeable slowdown

Page 13: Architecture and Design Automation for Application-Specific Processors

Pipeline Forwarding

[Jayaseelan et al., DAC 2006] 1 output

I/O Bandwidth limitations• Input bandwidth depends on number of pipeline stages• Does not increase output bandwidth

Complicates instruction scheduling

Page 14: Architecture and Design Automation for Application-Specific Processors

Register File Clustering

4 inputs

1 output

[Karuri et al., ICCAD 2007]

I/O Bandwidth limitations• Input bandwidth depends on number clusters• Does not increase output bandwidth

Compiler must eliminate inter-cluster copies• More clusters => more copies• NP-Hard

Page 15: Architecture and Design Automation for Application-Specific Processors

Shadow Registers

[Cong et al., FPGA 2005]1 output

I/O Bandwidth• No limitation on input bandwidth• Does not increase output bandwidth

Increases ISA bitwidth

Page 16: Architecture and Design Automation for Application-Specific Processors

Overview

• Architecture– Custom ISE Logic– I/O Bandwidth– Local memories and coherence

• Compilation and Synthesis• Conclusion

Page 17: Architecture and Design Automation for Application-Specific Processors

Architecturally Visible Storage

• DMA transfers data between memory and AVS• Coherence problem between AVS and D$

[Biswas et al., DATE 2006, TCAD 2007]

Page 18: Architecture and Design Automation for Application-Specific Processors

Example: IDCT (from JPEG)

Page 19: Architecture and Design Automation for Application-Specific Processors

The Coherence Problem

Page 20: Architecture and Design Automation for Application-Specific Processors

The Coherence Problem

Page 21: Architecture and Design Automation for Application-Specific Processors

The Coherence Problem

Page 22: Architecture and Design Automation for Application-Specific Processors

Overview

• Architecture– Custom ISE Logic– I/O Bandwidth– Local memories and coherence• Coherent and Speculative DMA• Virtual Ways• Way Stealing

• Compilation and Synthesis• Conclusion

Page 23: Architecture and Design Automation for Application-Specific Processors

Coherent DMA

Page 24: Architecture and Design Automation for Application-Specific Processors

Coherent DMA

Page 25: Architecture and Design Automation for Application-Specific Processors

Speculative DMA

Coherent DMA loads and evicts the array from AVS during each iteration

Speculative DMA waits until the array is overwritten in AVS memory by other data, or if the data is read/written by the D$.

Page 26: Architecture and Design Automation for Application-Specific Processors

Virtual Ways

Page 27: Architecture and Design Automation for Application-Specific Processors

AVS vs. Traditional Cache

Page 28: Architecture and Design Automation for Application-Specific Processors

AVS and Cache Ways are Similarif AVS Memory has 1-input, 1-output

Page 29: Architecture and Design Automation for Application-Specific Processors

Way Stealing

Page 30: Architecture and Design Automation for Application-Specific Processors

Way Stealing

No AVS memories (reduced area)No coherence protocol

Page 31: Architecture and Design Automation for Application-Specific Processors

Coherent AVS Summary• Speculative DMA

– Requires a coherence protocol• Lots of bus traffic• Good solution for coherent multiprocessor systems

– No limit on AVS memory organization– Uses standard cache IPs

• Virtual Ways– Requires non-traditional cache controller– No limit on AVS memory organization

• Way Stealing– Requires non-traditional cache– Number of ways limits number of AVS memories– All AVS memories have 1-input, 1-output– Keeps AVS memories within the cache

Page 32: Architecture and Design Automation for Application-Specific Processors

Overview

• Architecture• Compilation and Synthesis– ISE Identification Algorithms

• Conclusion

Page 33: Architecture and Design Automation for Application-Specific Processors

SW and HW Costs

Page 34: Architecture and Design Automation for Application-Specific Processors

Convex and Non-Convex Cuts

Page 35: Architecture and Design Automation for Application-Specific Processors

Integrating AVS Memories

Page 36: Architecture and Design Automation for Application-Specific Processors

Integrating AVS Memories

Page 37: Architecture and Design Automation for Application-Specific Processors

Single Cycle ISE Identification Problem

• Legality Constraints:– Convex cut– Contains no forbidden nodes– Number of inputs/outputs match architectural

constraints • (e.g., 2 RF inputs, 1 RF output)

• Objective:– Find the legal cut that maximizes speedup

Page 38: Architecture and Design Automation for Application-Specific Processors

Algorithms for ISE Identification

• Optimal (Exponential worst-case runtime)– Branch-and-bound search– Integer Linear Program Formulation

• Iterative Improvement– Evolutionary algorithms– Simulated annealing

• Polynomial-time Heuristics

Page 39: Architecture and Design Automation for Application-Specific Processors

Branch-and-Bound Search Example

[Atasu et al., DAC 2003]

Page 40: Architecture and Design Automation for Application-Specific Processors

Branch-and-Bound Search Example

[Atasu et al., DAC 2003]

Page 41: Architecture and Design Automation for Application-Specific Processors

ISE Identification Algorithms

[Kastner et al., ICCAD 2001][Brisk et al., CASES 2002][Sun et al., ICCAD 2002][Lee et al., ICCAD 2002][Atasu et al., DAC 2003][Goodwin and Petkov, DATE 2003][Peymandoust et al., ASAP 2003][Clark et al., MICRO 2003][Sun et al., ICCAD 2003][Lee et al., ISLPED 2003][Cong et al., FPGA 2004][Biswas et al., DAC 2004][Yu and Mitra, DAC 2004][Borin et al., ESTIMedia 2004][Kastens et al., LCTES 2004][Yu and Mitra, CASES 2004][Pozzi and Ienne, CASES 2005][Biswas et al., DAC 2005][Atasu et al., CODES-ISSS 2005][Sun et al., VLSI Design 2005][Biswas et al., DATE 2006][Galuzzi et al., CODES-ISSS 2006][Sun et al., VLSI Design 2006][Wong et al., HiPEAC 2007][Verma et al., CASES 2007][Pothineni et al., CDES 2007]

Conferences[Atasu et al., IJPP 2003][Clark et al., IJPP 2003][Sun et al., TCAD 2004][Clark et al., TCOMP 2005][Pozzi et al., TCAD 2006][Biswas et al., TVLSI 2006][Sun et al., TCAD 2006][Sun et al., TVLSI 2006][Biswas et al., TCAD 2007][Chen et al., TCAD 2007][Sun et al., TCAD 2007][Lee et al., TODAES 2007][Bonzini and Pozzi, TVLSI 2008][Zhao et al., IEICE Trans. Fund. 2008][Atasu et al., TCAD 2008][Murray et al., TECS 2009][Verma et al., TCAD 2010][Galuzzi and Bertels, TRETS 2011]

Journals[Pothineni et al., VLSI Design 2007][Bonzini and Pozzi, DATE 2007][Atasu et al., DATE 2007][Noori et al., DATE 2007][Galuzzi et al., SAMOS 2007][Galuzzi et al., ARC 2007][Bonzini and Pozzi, ASAP 2007][Wolinski and Kuchcinski, ASAP 2007][Galuzzi et al., ARC 2007][Yu and Mitra, FPL 2007][Bennet et al., LCTES 2007][Verma et al., ASPDAC 2008][Wolinski and Kuchcinski, DATE 2008][Galuzzi and Bertels, ARC 2008][Atasu et al., ASAP 2008][Galuzzi and Bertels, ReConFig 2008][Pothineni et al., VLSI Design 2008][Galuzzi et al., DATE 2009][Martin et al., ASAP 2009][Martin et al., SAMOS 2009][Kamal et al., ASAP 2010][Pothineni et al., VLSI Design 2010][Ahn et al., ASPDAC 2011][Xiao and Casseau, GLS-VLSI 2011][Xiao and Casseau, ASAP 2011][Ahn et al., CODES-ISSS 2011]

Page 42: Architecture and Design Automation for Application-Specific Processors

Overview

• Architecture• Compilation and Synthesis• Conclusion– Summary and Future Research Directions

Page 43: Architecture and Design Automation for Application-Specific Processors

Conclusion

• ASIP Architecture– Supply data bandwidth to ASCU– Ensuring coherence when using local memories

• ISE Identification– Problem formulation is well-understood– Extensions needed to support memory operations– Many effective algorithms exist

Page 44: Architecture and Design Automation for Application-Specific Processors

Future ASIP Research Directions• Parallel and Multi-core ASIPs

– Balance ISE speedup across many threads– ISE identification for parallel models of computation

• Concurrent state machines• Synchronous Data Flow / Kahn Process Networks• MapReduce

• Identify ACSU for Current AND Future Applications– Some ISEs are not known at design time– Must insert generality or programmability into the ACSU

• Application-specific GPUs– Identify vectorized and threaded ISEs– ACSU by hundreds of near-identical threads concurrently