EECC722 - Shaaban EECC722 - Shaaban #1 lec # 9 Fall 2006 10-23- Computing System Element Choices Specialization , Development cost/time Performance/Chip Area/Watt (Computational Efficiency) Programmability / Flexibility General Purpose Processors Application Specific Processors Re-configurable Hardware ASICs Superscalar VLIW DSPs Network Processors Graphics Processors ….. Reconfigurable Computing Also known as Custom Computing Machines (CCMs) Utilize hardware devices customized to match computation Using: FPGAs (Fine grain) or Micro-coded arrays of simple processors (coarse grain) GPPs Co-Processors
74
Embed
EECC722 - Shaaban #1 lec # 9 Fall 2006 10-23-2006 Computing System Element Choices Specialization, Development cost/time Performance/Chip Area/Watt (Computational.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Specialization , Development cost/time Performance/Chip Area/Watt(Computational Efficiency)
Programmability /Flexibility
GeneralPurpose Processors
ApplicationSpecificProcessors
Re-configurableHardware
ASICs
SuperscalarVLIW
DSPsNetwork ProcessorsGraphics Processors…..
Reconfigurable ComputingAlso known as Custom Computing Machines (CCMs) Utilize hardware devices customized to match computationUsing: FPGAs (Fine grain) or Micro-coded arrays of simple processors (coarse grain)
Computing Element Choices Observation • Generality and efficiency are in some sense inversely related to one
another:– The more general-purpose a computing element is and thus the greater the number of
tasks it can perform, the less efficient it will be in performing any of those specific tasks.
– Design decisions are therefore almost always compromises; designers identify key features or requirements of applications that must be met and and make compromises on other less important features.
• To counter the problem of computationally intense problems for which general purpose machines cannot achieve the necessary performance:
– Special-purpose processors, attached processors, and coprocessors have been built for many years, especially in such areas as image or signal processing (for which many of the computational tasks can be very well defined).
– The problem with such machines is that they are special-purpose; as problems change or new ideas and techniques develop, their lack of flexibility makes them problematic as long-term solutions.
• Reconfigurable computing or Custom Computing Machines (CCMs) using Reconfigurable computing or Custom Computing Machines (CCMs) using FPGAs (Field Programmable Gate Arrays, first introduced in 1986 by Xilinx) or other reconfigurableeconfigurable (customizable) hardware can offer an attractive alternative to other computing element choices.
FPGAs originally developed for hardware design verification, rapid-prototyping, and potential ASIC-replacement
What is Reconfigurable Computing?What is Reconfigurable Computing?• Utilize Utilize reconfigurable hardware devices: (spatially-programmed connections of hardware processing elements) tailored to application:
• Customizing hardware to match computations needed/present in a particular application by changing hardware functionality on the fly.
• Reconfigurable Computing GoalReconfigurable Computing Goal: Using reconfigurable hardware devices to build systems with advantages over conventional computing solutions in terms of:
- Flexibility - Performance - Power - Time-to-market - Life cycle cost
“Hardware” customized to specifics of problem.
Direct map of problem specific dataflow, control.
Circuits “adapted” as problem requirements change.
Computational EfficiencyComputational Efficiency
Hardware customization/reconfigurablity, how?Hardware customization/reconfigurablity, how?Change both Change both function function of hardware cells (elements)of hardware cells (elements)and their and their connectivitconnectivity to match requirements of y to match requirements of Computation/application Computation/application
Still spatial computing but both Still spatial computing but both functionalityfunctionality and and connectivityconnectivity of hardware elements are of hardware elements are not fixednot fixed
Conventional Programmable Processors:• Moderately wide datapath which have been growing larger over time (e.g. 16, 32, 64, 128
bits). • Support for large on-chip instruction caches which have also been been growing larger
over time that can now hold thousands of instructions.• High bandwidth instruction distribution so that several instructions may be issued per
cycle at the cost of dedicating considerable die area for instruction fetch/distribution/issue/scheduling.
• A single thread of computation control per processor core. (SMT changes this)
Configurable devices (such as FPGAs):• Narrow datapath (e.g. almost always one bit), • On-chip space for only one instruction per compute element -- i.e. the single instruction
which tells the FPGA array cell (Configurable Logic Block, CLB) CLB) what function to perform and how to route its inputs and outputs (connectivity to other cells).
• Minimal die area dedicated to instruction distribution such that it takes hundreds of thousands of compute cycles to change the active set of array instructions (e.g From one FPGA configuration to another) .
• Can handle regular and bit-level computations more efficiently than processors.
Benefits of Reconfigurable Logic DevicesBenefits of Reconfigurable Logic Devices • Non-permanent customization and application
development after fabrication– “Late Binding”
• Economies of scale (amortize large, fixed design costs)
• Shorter time-to-market than ASICs (dealing with evolving requirements and standards, new ideas)
Potential Disadvantages:
• Efficiency penalty (area, performance, power) compared to ASICs.
• Need for correctness Verification.
(common to all hardware-based solutions)
Customization achieved by changing both Customization achieved by changing both function function of of hardware elements and their hardware elements and their connectivitconnectivity to match y to match requirements of applicationrequirements of application
Technology Trends Driving Configurable ComputingTechnology Trends Driving Configurable Computing• Increasing gap between "peak" performance of general-purpose processors
and "average actually achieved" performance. – Most programmers don't write code that gets anywhere near the peak
performance of current superscalar CPUs • Improvements in FPGA hardware: capacity and speed:
– FPGAs use standard SRAM processes and "ride the commodity technology" curve (e.g. VLSI technology)
– Volume pricing even though customized solution • Improvements in synthesis and FPGA mapping/routing software • Increasing number of transistors on a (processor) chip (one billion+):
How to use them efficiently? – Bigger caches (Most popular)?– Multiple processor cores? (Chip Multiprocessors - CMPs)– SMT support?– IRAM-style vector/memory?– DSP cores or other application specific processors?– Reconfigurable logic (FPGA or other reconfigurable logic)?
A Combination of the above choices?Heterogeneous Computing System on a Chip?
Configurable Computing Configurable Computing Architectures• Configurable ComputingConfigurable Computing architectures combine elements of general-purpose
computing and application-specific integrated circuits (ASICs).
– The general-purpose processor operates with fixed circuits that perform multiple tasks under the control of software.
– An ASIC contains circuits specialized to a particular task and thus needs little or no software to instruct it.
• The configurable computer can execute software commands that alter its configurable devices (e.g FPGA circuits) as needed to perform a variety of jobs. i.e to changei.e to change
both both functionality functionality and and connectivityconnectivityof hardware elementsof hardware elements(cells)(cells)
Levels of the Reconfigurable Computational ElementsLevels of the Reconfigurable Computational Elements (according to grain size)(according to grain size)
ReconfigurableReconfigurableLogicLogic
ReconfigurableReconfigurableDatapathsDatapaths
adder
buffer
reg0
reg1
muxCLB CLB
CLBCLB
DataMemory
InstructionDecoder
&Controller
DataMemory
ProgramMemory
Datapath
MAC
In
AddrGen
Memory
AddrGen
Memory
ReconfigurableReconfigurableArithmeticArithmetic
ReconfigurableReconfigurableControlControl
Bit-Level Operationse.g. encoding
Dedicated data pathse.g. Filters, AGU
Arithmetic kernelse.g. Convolution
Configurable ProcessorsReal-Time Operating Systems (RTOS):Process management
– FPGA chips (Fine-grain reconfigurable hardware) , or – Micro-coded arrays of simple processors (Coarse-grain reconfigurable hardware) .
• A controller FPGA may load circuit configurations stored in memory onto the processor FPGA in response to the requests of the operating program.
• If the memory does not contain a requested circuit, the processor FPGA sends a request to the PC host, which then loads the configuration for the desired circuit.
• Common Hybrid Configurable Architecture Today: – One or more FPGAs on board connected to host via I/O bus (e.g PCI)
• Possible Future Hybrid Configurable Architecture: – Integrate a region of configurable hardware (FPGA or something else) onto
processor chip itself as reconfigurable functional units or coprocessors– Integrate configurable hardware onto DRAM chip=> Flexible computing without
memory bottleneck
Current Current Hybrid-Architecture on a chip:Hybrid-Architecture on a chip:
Hybrid FPGAs:
Integrate one or more hard-wiredGPPs with an FPGA on the same chipExample: Xilinx Vertex-II Pro, Virtex-4 FX (FPGA with one or two PowerPC cores)
Prototype Video Communications System • Uses a single FPGA to perform four functions that typically require separate chips.
• A memory chip stores the four circuit configurations and loads them sequentially into the FPGA.
• Initially, the FPGA's circuits are configured to acquire digitized video data.
• The chip is then rapidly reconfigured to transform the video information into a compressed form and reconfigured again to prepare it for transmission.
• Finally, the FPGA circuits are reconfigured to modulate and transmit the video information.
• At the receiver, the four configurations are applied in reverse order to demodulate the data, uncompress the image and then send it to a digital-to-analog converter so it can be displayed on a television screen.
Programmable Circuitry: FPGAsProgrammable Circuitry: FPGAs• Field-Programmable Gate Array (FPGA) introduced by Xilinx (1986). • Original target applications: hardware design verification, rapid-prototyping, and
potential ASIC-replacement.
• Programmable circuits can be created or removed by sending signals to gates in the logic elements (configuration bit stream).
• A built-in grid of circuits arranged in columns and rows allows the designer to connect a logic element to other logic elements or to an external memory or microprocessor.
• The logic elements are grouped in Configurable Logic Blocks (CLBs) that perform basic binary operations such as AND, OR and NOT
• Firms, including Xilinx and Altera, have developed devices with the capability of 4,000,000 or more equivalent gates.
• Recently, in addition to “ general-purpose” or generic FPGAs, more specialized FPGA families targeting specific areas such as DSP applications have been developed with hard-wired functional units (e.g. MAC units).
Field Programmable Gate Arrays (FPGAs)Field Programmable Gate Arrays (FPGAs)• Chip contains many small building blocks that can be configured to implement
different functions. – These building blocks are known as CLBs (Configurable Logic Blocks)
• FPGAs typically "programmed" by having them read in a stream of configuration information from off-chip
– Typically in-circuit programmable (As opposed to EPLDs -Electrically Programmable Logic Devices- which are typically programmed by removing them from the circuit and using a PROM programmer)
• 25% of an FPGA's gates are application-usable – The rest control the configurability, interconnects, etc.
• As much as 10X clock rate degradation compared to fully custom hardware implementations (ASICs)
• Typically built using SRAM fabrication technology. • Since FPGAs "act" like SRAM or logic, they lose their program when they lose
power. • Configuration bits need to be reloaded on power-up. • Usually reloaded from a PROM, or downloaded from memory via an I/O bus.
Customization achieved by changing both Customization achieved by changing both function function of of hardware elements (CLBs here) and their hardware elements (CLBs here) and their connectivitconnectivity y to match requirements of applicationto match requirements of application
Customization achieved by changing both Customization achieved by changing both function function of of hardware elements (CLBs here) and their hardware elements (CLBs here) and their connectivitconnectivity y to match requirements of applicationto match requirements of application
• (1) Hardware Design Specification: A hardware design to realize the selected hardware-bound computationally-intensive portion of the application is specified using RTL/HDL/logic diagrams.
• Synthesis & Layout: Vendor supplied device-specific software tools are used to convert the hardware design to netlist format. – (2) Partition the design into logic blocks (CLBs) : LUT Mapping– Then find a good (3) placement for each block and (4) routing
between them
• Then the serial configuration bitstream is generated (5) and fed down to the FPGAs themselves – The configuration bits are loaded into a "long shift register" on
the FPGA. – The output lines from this shift register are control wires that
control the behavior of all the CLBs on the chip.
Result of Hardware-Software Partitioning (co-design)
Overall Configurable Hardware ApproachOverall Configurable Hardware Approach• Select critical portions or phases of an application where hardware customizations will
offer an advantage: e.g. computationally intensive portion “kernel(s)” of application. • Map those application phases to FPGA hardware:
– Hand hardware design/RTL/VHDL – VHDL => synthesis & layout
• If it doesn't fit in FPGA, re-select application phase (smaller) and try again. • Perform timing analysis to determine rate at which configurable design can be clocked. • Write interface software for communication between main processor (GPP) and
configurable hardware: – Determine where input / output data communicated between software and
configurable hardware will be stored – Write code to manage its transfer (like a procedure call interface in standard
• This process turns applications programmers into:– Part-time hardware designers.
• Performance analysis problems => what should we put in hardware?
• Hardware-Software Co-design problem • Choice and granularity of computational elements.• Choice and granularity of interconnect network.• Synthesis problems • Testing/reliability problems.
– Time to load configuration bitstream – may take seconds (improving)
• Reconfiguration latency hiding techniques.• I/O bandwidth limitations: Need for tight coupling. • Speed, power, cost, density (improving)• High-level language support (improving) • Performance, space estimators • Design verification • Partitioning and mapping across several FPGAs• Partial reconfiguration • Configuration caching. Supported in some recent Supported in some recent