Exploit the Integrated Graphics in Packet Processing

EXPLOIT THE INTEGRATED GRAPHICS IN PACKET PROCESSING

Speaker:

Supervisor:

Course:

Academic year:

Francesco Corazza

Prof. Fulvio Risso

Progetto di Reti Locali

2010/2011

2

Scenario

Packet processing are demanding more performances:• Increasing network speed• More intelligence in network devices• Deeper packet analysis• …

Intel is the best network hardware choice thanks to:• Scale economy• Price/quality ratio• Power Consumption

We will deal with packet processing on Intel platforms…

Francesco Corazza

3

Overview

Issues:• Intel

• Have not yet deployed efficient tools for our needs

• Discrete GPU• Heavy• Expensive• Not power-saving• Affected by BUS bottleneck

Focus:• Consumer platforms• CPU + GPU solutions

Two different objectives can be identified…

Francesco Corazza

4

How convenient hardware can be

exploited in these app?

Presentation Structure

What kind of application is

packet processing?

Which features

differentiate them from

general computing?

What is the hardware best

fit on these applications?

What is the hardware

most profitable for these app?

Francesco Corazza

GPU solutions

CPU+GPU

solutions

Focus on the Field

Focus on Integrated Graphics

Objectives:

Chapter Division:

FOCUS ON THE FIELD

6

Focus on the field

• What kind of application is packet processing?

• Which features differentiate them from general computing?

• What is the hardware best fit on these applications?

• What is the hardware most profitable for these app?

• How convenient hardware can be exploited in these app?

Focus on the FieldFrancesco Corazza

7

Packet processing Applications• Memory intensive

• Frequent data load from packet• Huge amount of data involved in the processing

• No data locality• Unpredictable loads from different memory areas

• Small tasks, over a large number of packets


8

Focus on the field







11

Differences in hardware will mirror differences in software…

General computing vs. Packet processing

Francesco Corazza

StructureMemory access patterns

Core activity

CPU bounded

ALU-based computation

Locality pattern

Caches are useful

Complex tasks launched once

Small amount of memory required

Memory bounded

Load/Store-based

computation

Random pattern

Unpredictable loads from memory

Very repetitive small tasks

Huge amount of memory involved

General Computing Application

Packet Processing Application

12

Focus on the field







13

Network Processors• Memory

• Narrow data buses• Multiple data buses• Memory Hierarchies• Few caches

• Superscalar execution• Massive number of threads• Thread-level parallelism• Zero-overhead switching• Asynchronous code

Packet processing is a market niche, so the industry was obliged to move to solutions borrowed from mainstream consumer market…


Packet processing Applications

• Memory intensive• Huge amount of data involved

in the processing• Frequent data load from packet

• No data locality• Unpredictable loads from

different memory areas


14

Network Hardware Evolution


The scale economies have dropped out specific hardware:

• Network Processors• CISCO• Tilera• …

• Consumer Processors• GPU solutions

• Nvidia Fermi• CPU+GPU solutions

• Our investigation lays here• Hybrid Processors

• Intel Many Integrated Core• AMD Fusion

TIME

15

Focus on the field




• What is the hardware most profitable for these app?• GPU

• CPU + GPU

• Intel MIC



16

GPU – Features • Shared Memory

• High bandwidth• Coalesced access

• Lots of Execution Units• Slow cores• Massive parallelism

• SIMT execution model• More flexible than SIMD


• Memory intensive• Huge amount of data involved

in the processing• Frequent data load from packet

• No data locality• Unpredictable loads from

different memory areas


Packet processing Applications

19

CPU + GPU solutions

… just wait few slides to find out how it will end up

Let's take a look to the architectures that we will face in the future…


20

Intel MIC (Many Integrated Core)• Built from Single-Chip Cloud Computer and Larrabee

researches• Programming GPU with x86 Instruction Set

• Development tools in common with Xeon• Same tools can compile both for the processor and for the co-processor• HPC market target

• Knights Corner (First Implementation):• 50 x86 cores: four threads, 64KB L1, 256KB L2 cache, 512-bit

vector unit, GDDR5 memory, PCI Express 2.0


21

Focus on the field





• How convenient hardware can be exploited in these app?• GPGPU• DirectCompute• OpenCL


22

GPGPU – Overview • General-Purpose computing on graphics processing units

• Programming GPUs through accessible programming interfaces and industry-standard languages such as C

• Allows software developers to use stream processing on non-graphics data

• Competing interfaces• Nvidia Compute Unified Device Architecture (CUDA)• AMD Stream (now joined into OpenCL)• Microsoft DirectCompute (new subset of DirectX10/11 APIs)

• Convergence towards standardization (like OpenGL)• Khronos Group OpenCL

These frameworks lye just above hardware…


23

GPGPU – Layer representation


Accelerator, Brook+, Rapidmind, Ct

MKL, ACML, cuFFT, D3DX, etc.

Media playback or processing, media UI, recognition, etc. Technical

DirectCompute, CUDA, CAL, OpenCL, LRB Native, etc.

CPU, GPU, LarrabeenVidia, Intel, AMD, S3, etc.

Applications

Processors

Domain Libraries

Domain Languages

Compute Languages

25

GPGPU – Analysis• CUDA

• Tight hardware integration• Depence on Nvidia hardware

• OpenCL • Give up lower-level hooks into the architecture • Heterogeneous computational resources• Integration in the Khronos family (eg. OpenGL)

• DirectCompute• Only Windows (Wine/Mono are immature)• Integration in DirectX APIs• GPGPU under the hood of Windows 7

For their spread, we are going to cover the latter two languages…


26

DirectCompute

Exposes the compute functionality of the GPU as a new type of shader (tool that determines the final appearance of an object's surface)

• Compute Shader • Delivers the performance of 3-D games to new applications

• Rendering integration• Demonstrates tight integration between computation and rendering

• Supported by all processor vendors• DirectX 10.1/11.0 respectively support Compute Shader 4.0/5.0

• Scalable parallel processing model• Code should scale for several generations


27

DirectCompute – Rendering Pipeline


Render scene

Write out scene image

Use Compute for image post-processing

Output final image

30

DirectCompute – Programming Model

Threads in the same group run concurrently


Dispatch• 3D grid of thread groups

Thread Group • 3D grid of threads • numThreads(nX, nY, nZ)

Thread• One invocation of a shader

31

DirectCompute – Execution Model

• A thread is executed by a scalar processors

• A thread group is executed on a multiprocessor

• A compute shader kernel is launched as a grid of thread-groups (Only one grid of thread groups can execute on a device at one time)


35

DirectCompute – Example HLSL codestruct BufferStruct{ uint4 color;};

// group size

#define thread_group_size_x 4

#define thread_group_size_y 4

RWStructuredBuffer<BufferStruct> g_OutBuff;

/* This is the number of threads in a thread group, 4x4x1 in this example case */

// e.g.: [numthreads( 4, 4, 1 )]

[numthreads( thread_group_size_x, thread_group_size_y, 1 )]

void main( uint3 threadIDInGroup : SV_GroupThreadID, uint3 groupID : SV_GroupID, uint groupIndex : SV_GroupIndex, uint3 dispatchThreadID : SV_DispatchThreadID )

{

int N_THREAD_GROUPS_X = 16; // assumed equal to 16 in dispatch(16,16,1)

int stride = thread_group_size_x * N_THREAD_GROUPS_X;

// buffer stide, assumes data stride = data width (i.e. no padding)

int idx = dispatchThreadID.y * stride + dispatchThreadID.x;

float4 color = float4(groupID.x, groupID.y, dispatchThreadID.x, dispatchThreadID.y);

g_OutBuff[ idx ].color = color;

}


36

OpenCL – Overview

Open Computing Language• Access to heterogeneous computational resources• Parallel execution on single or multiple processors

• GPU, CPU, GPU + CPU or multiple GPUs

• Desktop and Handheld Profiles• Work with graphics APIs

• OpenGL

• C99 with extensions• Familiar to developers• Rich set of built-in functions• Easy to develop data- and task- parallel compute programs• Defines hardware and numerical precision requirements


37

OpenCL – Execution Model (I)• Work item

• Basic unit of work on an OpenCL device

• Kernel• Basic unit of executable code • Similar to a C function• Data-parallel or task-parallel

• Program• Collection of kernels and functions• Analogous to a dynamic library

• Context • Environment within which work- items executes

• Applications • Queue kernel execution instances

• In-order: one queue to a device

• Executed in-order or out-of-order


43

OpenCL – Coding (I)• Work-item

• Smallest execution entity• Every time a Kernel is launched, lots of work-items (a number specified by the

programmer) are launched, each one executing the same code • Unique ID

• Accessible from the kernel• Used to distinguish the data to be processed by each work-item

• Work-group• Allow communication and cooperation between work-items • Reflect work-items organization

• (N-dimensional grid of work-groups, N = 1, 2 or 3)• Independent element of execution in N-D domain

• ND-Range• Computation domain (Organization level)• Specify how work-groups are organized

• (N-dimensional grid of work-groups, N = 1, 2 or 3)• Defines the total number of work-items that execute in parallel


44

OpenCL – Coding (II)


45

OpenCL – Coding (III)

Process a 1024 x 1024 imageGlobal problem dimensions:

• 1024 x 1024 = 1 kernel execution per pixel• 1,048,576 total executions


scal

ar

data

-par

alle

lvoid scalar_mul ( int n, const float *a, const float *b, float *result){

int i;for (i=0; i<n; i++)result[i] = a[i] * b[i];

}

kernel void dp_mul(global const float *a,global const float *b, global float *result ) {

int id = get_global_id(0);result[id] = a[id] * b[id];

}// execute dp_mul over “n” work-items

FOCUS ONINTEGRATED GRAPHICS

47

CPU+GPU solutions

The architectures involved are:• Intel Core 2° Generation (Sandy Bridge)• Intel Atom E600 Series (Tunnel Creek)• Nvidia Tegra (Tegra 2)• AMD Fusion

Let’s compare them…

Focus on Integrated GraphicsFrancesco Corazza

48

CPU+GPU solutions


Market Target Release Date

Desktop / Hi-End 01/2011

Mobile / Industrial embedded

11/2010

Mobile / Tablets 01/2010

Consumer / Desktop 01/2011

49


• Intel Core 2° Generation (Sandy Bridge)• Features• Integrated GPU• AVX (Advanced Vector Extensions)

• Intel Atom E600 Series (Tunnel Creek)• Nvidia Tegra (Tegra 2)

• AMD Fusion


50

Sandy Bridge – Features (I) • CPU die redesigned

• Chip’s northbridge and GPU are both on-die (in the previous versions they were on a physically separate chip)

• LLC (Last Level Cache, formerly L3 Cache) • Thanks to new ring bus LLC is shared amongst all components,

including the GPU• Each individual core had its own private path to the LLC cache

• Unified Memory Architecture (UMA)• Architecture where the graphics subsystem does not have

exclusive dedicated memory and uses the host system’s memory• Dynamic Video Memory Technology (DVMT)

• Hyper Threading


51

Sandy Bridge – Features (II)• Turbo Boost Technology 2.0

• Adjust the processor core and GPU frequencies to increase performance and maintain the allotted power/thermal budget

• Processor can increase individual core speed or graphics speed as the workload dictates

• Developers cannot directly control it

• AVX (Advanced Vector eXtension)• Extends SIMD instructions from 128 bits to 256 bits. • AVX enables a single instruction to work on eight floating points at

a time instead of the four that the current SIMD provides• Increased processor performance with minimal power gains

(HUGI: Hurry Up And Get Idle)

Next diagram shows the integration that Intel have reached…


52

Sandy Bridge – Block Diagram

Now we have to zoom in into the graphic processor…


53

Sandy Bridge – Integrated GPU (I)


54

Sandy Bridge – Integrated GPU (II)• DirectCompute support

• DirectX 10.1• The internal ISA maps one-to-one with most DirectX10 API

instructions resulting in a very CISC-like architecture

• Execution Unit (EU)• The pipeline decoder uses only fixed-type function logic to limit the

overall power consumption (unlike NVIDIA and AMD that have programmable stream processors)

• Each EU can dual issue picking instructions from multiple threads• Transcendental math is handled by hardware in the EU and its

performance has been sped up considerably

GPU’s parallel capabilities are exploited thanks DirectCompute, but what about CPU?


55

AVX – Overview

Some assembly instructions can show the power of AVX…

Francesco Corazza Focus on Integrated Graphics

•KEY FEATURES• Wider Vectors

• Increased from 128 to 256 bit• Two 128-bit load ports

• Enhanced Data Rearrangement• Use the new 256 bit primitives to broadcast, mask loads and stores and data permutes

• Three and four Operands• Non Destructive Source for both AVX 128 and AVX 256

• Flexible unaligned memory access support• Extensible new opcode (VEX)

•BENEFITS• Higher peak FLOPs with good power efficiency• Organize, access and pull only necessary data more quickly and efficiently• Fewer register copies, better register use for both vector and scalar code• More opportunities to fuse load and compute operations• Code size reduction

56

AVX – Instructions (I)


57

AVX – Instructions (II)


58

AVX – Code Example (I)H

igh

leve

l cod

e:

Ass

embl

y:


#include <immintrin.h>

void foo(float *a, float *b, float *r){

__m256 s1, s2, res;

s1 = _mm256_loadu_ps(a);s2 = _mm256_loadu_ps(b);

res = _mm256_add_ps(s1, s2); _mm256_storeu_ps(r, res);

}

; -- Begin _fooALIGN 16 PUBLIC _foo

_foo PROC NEAR; parameter 1: 4 + esp ; parameter 2: 8 + esp ; parameter 3: 12 + esp$B2$1: ; Preds $B2$0

mov eax, DWORD PTR [4+esp] mov edx, DWORD PTR [8+esp] mov ecx, DWORD PTR [12+esp] vmovups ymm0, YMMWORD PTR [eax] vaddps ymm1, ymm0, YMMWORD PTR [edx] vmovups YMMWORD PTR [ecx], ymm1; LOE ebx ebp esi edi

$B2$2: ; Preds $B2$1ret ;10.1ALIGN 16

; LOE_foo ENDP ;_foo ENDS

61

AVX – Benchmarks


62

AVX – Benchmarks


SIMD processing works best with data-parallel applications where the data is arranged in a

structure of array (SOA) format. Graphics and image processing applications are often highly parallel and

well-structured, and thus are typically good candidates for SIMD processing. Geometry or mesh

data, on the other hand, is not always uniformly structured in a neat grid.

63

Sandy Bridge – Conclusion • Interesting features for packet processing

• Integrated Memory controller• DirectCompute• AVX

• CPU+GPU integration is only on the physical layer• Packet processing can exploit CPU or GPU• Unpredictable evolution

• DirectCompute could exploit CPU• AVX could exploit GPU

• Next Ivy Bridge will support both OpenCL and DirectX11


64


• Intel Core 2° Generation (Sandy Bridge)

• Intel Atom E600 Series (Tunnel Creek)• Features• Block Diagram• Customization

• Nvidia Tegra (Tegra 2)

• AMD Fusion


65

Atom E600 – Features (I) • SoC (System on Chip)• Power optimized

• Fanless performance

• I/O flexible and open• Flexible application Specific Needs• PCIe instead of proprietary FSB

• 7 years long life support

• Hyper-Threading Technology• Two logical processors

• SSE3 (Streaming SIMD Extensions)• Support for SIMD intructions


66

Atom E600 – Features (II) • Power saving

• Intel SpeedStep Technology• Enables the operating system to program a processor to transition to

lower frequency and/or voltage levels while executing a workload

• Deep power down technology• Able to reduce static power consumption by turning off power to cache

and other sub-systems in the processor.

• In-order processing• Guarantees greater power efficiency, CPU will not reorder an instruction

stream to extract instruction-level parallelism

• DirectCompute support• Tunnel Creek supports only DirectX9

The next diagram shows the insight of the Atom architecture…


67

Atom E600 – Block Diagram

Atom does not support DirectCompute, so we have to concentrate on the great

flexibility of the architecture…


68

Atom E600 – Customization • Open connection

• Developers can attach the processor to a variety of chipsets• application-specific third-party

chipsets• FPGAs• ASIC

• Processor can be used without a chipset (limited I/O needs)• The processor’s four PCIe

connections can attach to discrete PCIe peripherals such as Ethernet controllers


69

Atom E600 – Conclusion • Interesting features for packet processing

• Power saving features• Long support • Flexible Architecture

• Any support to GPGPU• Old school GPGPU

• Use OpenGL ES 2.0 shaders (programmable shaders)• Rewrite the code as a fragment shader

• Wait for Cedar Trail (2011 – not yet released)• DirectX 10.1


70


• Intel Core 2° Generation (Sandy Bridge)

• Intel Atom E600 Series (Tunnel Creek)

• Nvidia Tegra (Tegra 2)• Features

• Block Diagram

• AMD Fusion


71

Tegra – Features • SoC (System-on-a-chip)

• ARM CPU Dual Core• GeForce GPU

• ULP (Ultra-low power consumption)• Graphics support

• No DirectX support• No CUDA support• OpenGL ES 2.0 support

The next diagram shows quantitatively a view of a Tegra chip…


72

Tegra – Block Diagram


73

Tegra – Conclusion• Interesting features for packet processing

• Integrated Memory controller• Low power consumption

• Any support to GPGPU• Old school GPGPU

• Use OpenGL ES 2.0 shaders (programmable shaders)• Rewrite the code as a fragment shader

• Wait for Tegra 3 (third quarter of 2011)• DirectX 11• CUDA


74


• Intel Core 2° Generation ( Sandy Bridge)

• Intel Atom E600 Series (Tunnel Creek)

• Nvidia Tegra (Tegra 2)

• AMD Fusion• AMD Vision

• Features

• APU Roadmap

• Integration Highlights


75

Fusion – AMD Vision

Fusion is a step-forward technology:

AMD have realized this heterogeneous architecture developing APUs…


76

Fusion – Features (I)


Video

http://www.youtube.com/watch?v=BihrG7DhhBM&feature=related

77

Fusion – Features (II) • DirectCompute support (DirectX 11)• OpenCL 1.1

• Additive capabilities of an APU and a discrete graphics solution

• Power-oriented benefits

• Massive SIMD GPU (SSE5)• Programmable scalar and vector

processor cores

• APU family• Bulldozer (Sandy Bridge’s opponent)

• Performance and scalability

• Bobcat (Atom’s opponent)

Let’s compare this two solutions…


79

Fusion – Features (III)

The difference between Bulldozer/Bobcat is also the market target…


81

Fusion – APU roadmap

The high level of integration differentiate APUs from CPUs…


82

Fusion – Integration Highlights• Shared memory

• Lower latencies

• PCI Express • Cut down some latencies

• No discrete GPU, less• Cost• Power• Motherboard complexity


83

Fusion – Conclusion• Interesting features for packet processing

• OpenCL/DirectCompute/SSE5• Architecture tight integrated• New technology (First-Come-First-Served)

• OpenCL• Could be the “El Dorado” for packet processing

• CPU/GPU working in AND/OR configuration• Shared Memory• Embedded implementation of Fusion technology

• AMD declaredly support it to bring the power of heterogeneous computing mainstream


CONCLUSIONS

85

Summary (I)This presentation has disclosed several ways of exploiting integrated graphics and, more generally, consumer architectures for packet processing:

• GPGPU-driven solutions• CUDA, OpenCL, DirectX11

• SIMD-driven solutions• Exploit very parallel operations through this SIMD implementation• AVX, SSE

• Custom hardware solutions• Design flexible modules tailored on specific needs• FPGA

The former solutions are the most in vogue at the moment…

ConclusionsFrancesco Corazza

86

Summary (II)


Open CL SSE FPGA Direct

Compute Open

GL

X V(AVX)

X V V

XV

(SSE 3)

V X V

XV

(SSE 3)

X X V

VV

(SSE 5)

X V V

87

Recommendations

Write directly parallel code is more efficient than hardware parallelization:


THANK YOUQuestions?

89

Bibliography• Lecture notes of course “Tecnologie per reti di calcolatori”• http://www.intel.com/technology/architecture-silicon/2ndgen/index.htm• http://www.intel.com/technology/atom/index.htm• http://www.intel.com/technology/architecture-silicon/mic/index.htm• http://sites.amd.com/us/fusion/apu/pages/fusion.aspx• http://www.hwupgrade.it/articoli/cpu/2674/intel-sandy-bridge-analisi-dell-architettura_i

ndex.html• http://www.anandtech.com/show/3922/intels-sandy-bridge-architecture-exposed/• http://www.multicorepacketprocessing.com/ • http://www.nvidia.co.uk/object/tegra-2.html• http://www.tomshardware.com/reviews/sandy-bridge-fusion-nvidia-chipset,2763-6.ht

ml• http://www.tomshardware.com/reviews/amd-fusion-brazos-zacate,2786-2.html• http://gpgpu.org/• http://channel9.msdn.com/tags/DirectCompute-Lecture-Series/• http://gpgpu-computing.blogspot.com/• http://blogs.msdn.com/b/chuckw/archive/2010/07/14/directcompute.aspx• http://www.khronos.org/developers/resources/opencl/#ttutorials• http://www.youtube.com/watch?v=VIs1CxuUrpc&feature=related

Francesco Corazza

http://www.intel.com/technology/architecture-silicon/2ndgen/index.htm

http://www.intel.com/technology/atom/index.htm

http://www.intel.com/technology/architecture-silicon/mic/index.htm

http://sites.amd.com/us/fusion/apu/pages/fusion.aspx

http://www.hwupgrade.it/articoli/cpu/2674/intel-sandy-bridge-analisi-dell-architettura_index.html

http://www.hwupgrade.it/articoli/cpu/2674/intel-sandy-bridge-analisi-dell-architettura_index.html

http://www.anandtech.com/show/3922/intels-sandy-bridge-architecture-exposed/

http://www.multicorepacketprocessing.com/

http://www.nvidia.co.uk/object/tegra-2.html

http://www.tomshardware.com/reviews/sandy-bridge-fusion-nvidia-chipset,2763-6.html

http://www.tomshardware.com/reviews/sandy-bridge-fusion-nvidia-chipset,2763-6.html

http://www.tomshardware.com/reviews/amd-fusion-brazos-zacate,2786-2.html

http://gpgpu.org/

http://channel9.msdn.com/tags/DirectCompute-Lecture-Series/

http://channel9.msdn.com/tags/DirectCompute-Lecture-Series/

http://gpgpu-computing.blogspot.com/

http://gpgpu-computing.blogspot.com/

http://blogs.msdn.com/b/chuckw/archive/2010/07/14/directcompute.aspx

http://blogs.msdn.com/b/chuckw/archive/2010/07/14/directcompute.aspx

http://www.khronos.org/developers/resources/opencl/%23ttutorials

http://www.khronos.org/developers/resources/opencl/%23ttutorials

http://www.youtube.com/watch?v=VIs1CxuUrpc&feature=related

http://www.youtube.com/watch?v=VIs1CxuUrpc&feature=related

Exploit the Integrated Graphics in Packet Processing

Education

francesco corazza focus

convenient hardware

packet huge

hardware best fit

specific hardware

francesco corazzafocus

packet memory hierarchies

processing frequent