Stream Processors: Programmability with Efficiency Presented by Satyam Dhar 4NI08EC056
Aug 27, 2014
Presented by
Satyam Dhar 4NI08EC056
Efficiency
refers to the power efficiency of a chip in performing a given task or executing an operation or calculation. It is measured in GOPS/W(Giga Operations Per Second Per Watt) Programmability is the the capability within hardware and software to change; to accept a new set of instructions that alter its behavior
At the system level, a choice is to be made between flexibility and power efficiency Specialized architectures(eg. ASICs) are better in performance( speed, power consumption) but no flexibility DSP or Microprocessors are highly flexible but these do not provide the high efficiency needed by the application Hence, a trade-off is to be made between efficiency and programmability
It
is a computer programming paradigm, related to SIMD (single instruction, multiple data) It allows some applications to more easily exploit a limited form of parallel processing. The basic idea is that single instruction acts on multiple data i.e. a stream of data.
The
Main Idea:
Stream 43 Stream 2 data Stream data data Stream 1 data datadata data data data data data data data data data data data data data data
Programmable Kernel
The
Main Idea:
Stream 43 Stream 2 data Stream data datadata data data data data data data data data data data data
Stream 1Programmable Kerneltransformed data transformed data transformed data transformed data transformed data
The
Main Idea:
Stream 43 Stream datadata data data data data data data data data
Stream 2 Stream 1 data Programmable Kernel
transformed data data transformed data data transformed data data transformed data data transformed data
The
Main Idea:Stream 32 Stream data Stream 1 data data Programmable Kernel
Stream 4data data data data data
transformed data data data transformed data data data transformed data data data transformed data data transformed data
The
Main Idea:Stream 43 Stream 2 data Stream data data Stream 1 data data Programmable Kernel
data transformed data data data data transformed data data data data transformed data data data transformed data data transformed data
Streams:
Streams are sets of data
elements. All
elements are a single data type.
Stream
elements can be simple, such as a single number, or complex, such as the coordinates of a triangle in 3D space.
Kernels:
Kernels are pieces of code that operate on streams. They take a stream as input and produce a stream as output. Kernels can be chained together Kernels can have one or more input and output streams performs complex calculations
Conventional,
sequential paradigm: for(int i = 0; i < 100 * 4; i++) result[i] = source0[i] + source1[i];
Parallel
SIMD paradigm: for(int el = 0; el < 100; el++) // for each vector vector_sum(result[el], source0[el], source1[el]);
Types 1. 2. 3.
of Parallelism and Locality exhibited: Instruction-Level Parallelism Data-Level Parallelism Produce-Consumer Locality
A
Stream Program expresses a computation as a signal flow graph with streams of records (the edges) flowing between computation kernels (the nodes).
One
huge advantage of Stream Processors: Partitioning of storage structures to support many ALUs operands for arithmetic operations reside in local register files (LRFs) near the ALUs Streams of data are stored in a stream register file (SRF) Reduces on-chip memory required and hence, highly power efficient
Hardware Implementation: the Imagine Stream ProcessorTransfer data between parts of the chip.
Hardware Implementation: the Imagine Stream ProcessorLocal storage and reuse of intermediate streams.
Hardware Implementation: the Imagine Stream Processor
Store kernel code.
Hardware Implementation: the Imagine Stream Processor
Execute one kernel at a time.
Hardware Implementation: the Imagine Stream Processor
Connection with other Imagine chips.
A
conventional processor has only a few (typically fewer than four)arithmetic units Thus, unable to exploit much of the parallelism exposed by a stream program. A conventional processor is unable to realize much kernel locality because it has too few processor registers(typically fewer than 32, compared with thousands for a stream processor)
Most
of the energy consumed by a modern microprocessor or DSP is consumed by data and instruction movement(only 1% in performing arithmetic calculations) A stream processor exploits data and instruction locality to reduce this overhead Approximately 30 percent of the energy is consumed by arithmetic operations.
A stream processor time-multiplexes its hardware over the kernels of an application All of the clusters work together on one kerneleach operating on different datathen they all proceed to the next kernel, and so on.
1. 2.
Mapping an application to a stream processor involves two steps: kernel scheduling, in which the operations of each kernel are scheduled on the ALUs of a cluster stream scheduling, in which kernel executions and data transfers are scheduled to use the SRF efficiently and to maximize data locality. Researchers of Stanford University have developed a set of programming tools that automate both of these tasks so that a stream processor can be programmed entirely in C without sacrificing efficiency.
The
stream processing benefits are limited to applications where similar operation is to be performed on a large data stream. If the work performed on each element is not of the same type, stream processing is inefficient Inertia i.e. learning to use the stream programming tools and writing a complex streaming application still represents a significant effort.
Though ASICs have efficiency as good as or better than stream processors, they are costly to design and lack flexibility. a single stream processor can be reused across many applications with no incremental design cost flexibility also permits new algorithms and functions to be easily implemented Due to: 1. competitive energy efficiency, 2. lower recurring costs, and 3. the advantages of flexibility, we expect stream processors to replace ASICs in the most demanding of signal-processing applications.