Top Banner
EXPLOIT THE INTEGRATED GRAPHICS IN PACKET PROCESSING Speaker: Supervisor: Course: Academic year: Francesco Corazza Prof. Fulvio Risso Progetto di Reti Locali 2010/2011
70
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Exploit the Integrated  Graphics in Packet Processing

EXPLOIT THE INTEGRATED GRAPHICS IN PACKET PROCESSING

Speaker:

Supervisor:

Course:

Academic year:

Francesco Corazza

Prof. Fulvio Risso

Progetto di Reti Locali

2010/2011

Page 2: Exploit the Integrated  Graphics in Packet Processing

2

Scenario

Packet processing are demanding more performances:• Increasing network speed• More intelligence in network devices• Deeper packet analysis• …

Intel is the best network hardware choice thanks to:• Scale economy• Price/quality ratio• Power Consumption

We will deal with packet processing on Intel platforms…

Francesco Corazza

Page 3: Exploit the Integrated  Graphics in Packet Processing

3

Overview

Issues:• Intel

• Have not yet deployed efficient tools for our needs

• Discrete GPU• Heavy• Expensive• Not power-saving• Affected by BUS bottleneck

Focus:• Consumer platforms• CPU + GPU solutions

Two different objectives can be identified…

Francesco Corazza

Page 4: Exploit the Integrated  Graphics in Packet Processing

4

How convenient hardware can be

exploited in these app?

Presentation Structure

What kind of application is

packet processing?

Which features

differentiate them from

general computing?

What is the hardware best

fit on these applications?

What is the hardware

most profitable for these app?

Francesco Corazza

GPU solutions

CPU+GPU

solutions

Focus on the Field

Focus on Integrated Graphics

Objectives:

Chapter Division:

Page 5: Exploit the Integrated  Graphics in Packet Processing

FOCUS ON THE FIELD

Page 6: Exploit the Integrated  Graphics in Packet Processing

6

Focus on the field

• What kind of application is packet processing?

• Which features differentiate them from general computing?

• What is the hardware best fit on these applications?

• What is the hardware most profitable for these app?

• How convenient hardware can be exploited in these app?

Focus on the FieldFrancesco Corazza

Page 7: Exploit the Integrated  Graphics in Packet Processing

7

Packet processing Applications• Memory intensive

• Frequent data load from packet• Huge amount of data involved in the processing

• No data locality• Unpredictable loads from different memory areas

• Small tasks, over a large number of packets

Focus on the FieldFrancesco Corazza

Page 8: Exploit the Integrated  Graphics in Packet Processing

8

Focus on the field

• What kind of application is packet processing?

• Which features differentiate them from general computing?

• What is the hardware best fit on these applications?

• What is the hardware most profitable for these app?

• How convenient hardware can be exploited in these app?

Focus on the FieldFrancesco Corazza

Page 9: Exploit the Integrated  Graphics in Packet Processing

11

Differences in hardware will mirror differences in software…

General computing vs. Packet processing

Francesco Corazza

StructureMemory access patterns

Core activity

CPU bounded

ALU-based computation

Locality pattern

Caches are useful

Complex tasks launched once

Small amount of memory required

Memory bounded

Load/Store-based

computation

Random pattern

Unpredictable loads from memory

Very repetitive small tasks

Huge amount of memory involved

General Computing Application

Packet Processing Application

Page 10: Exploit the Integrated  Graphics in Packet Processing

12

Focus on the field

• What kind of application is packet processing?

• Which features differentiate them from general computing?

• What is the hardware best fit on these applications?

• What is the hardware most profitable for these app?

• How convenient hardware can be exploited in these app?

Focus on the FieldFrancesco Corazza

Page 11: Exploit the Integrated  Graphics in Packet Processing

13

Network Processors• Memory

• Narrow data buses• Multiple data buses• Memory Hierarchies• Few caches

• Superscalar execution• Massive number of threads• Thread-level parallelism• Zero-overhead switching• Asynchronous code

Packet processing is a market niche, so the industry was obliged to move to solutions borrowed from mainstream consumer market…

Focus on the FieldFrancesco Corazza

Packet processing Applications

• Memory intensive• Huge amount of data involved

in the processing• Frequent data load from packet

• No data locality• Unpredictable loads from

different memory areas

• Small tasks, over a large number of packets

Page 12: Exploit the Integrated  Graphics in Packet Processing

14

Network Hardware Evolution

Focus on the FieldFrancesco Corazza

The scale economies have dropped out specific hardware:

• Network Processors• CISCO• Tilera• …

• Consumer Processors• GPU solutions

• Nvidia Fermi• CPU+GPU solutions

• Our investigation lays here• Hybrid Processors

• Intel Many Integrated Core• AMD Fusion

TIME

Page 13: Exploit the Integrated  Graphics in Packet Processing

15

Focus on the field

• What kind of application is packet processing?

• Which features differentiate them from general computing?

• What is the hardware best fit on these applications?

• What is the hardware most profitable for these app?• GPU

• CPU + GPU

• Intel MIC

• How convenient hardware can be exploited in these app?

Focus on the FieldFrancesco Corazza

Page 14: Exploit the Integrated  Graphics in Packet Processing

16

GPU – Features • Shared Memory

• High bandwidth• Coalesced access

• Lots of Execution Units• Slow cores• Massive parallelism

• SIMT execution model• More flexible than SIMD

Focus on the FieldFrancesco Corazza

• Memory intensive• Huge amount of data involved

in the processing• Frequent data load from packet

• No data locality• Unpredictable loads from

different memory areas

• Small tasks, over a large number of packets

Packet processing Applications

Page 15: Exploit the Integrated  Graphics in Packet Processing

19

CPU + GPU solutions

… just wait few slides to find out how it will end up

Let's take a look to the architectures that we will face in the future…

Focus on the FieldFrancesco Corazza

Page 16: Exploit the Integrated  Graphics in Packet Processing

20

Intel MIC (Many Integrated Core)• Built from Single-Chip Cloud Computer and Larrabee

researches• Programming GPU with x86 Instruction Set

• Development tools in common with Xeon• Same tools can compile both for the processor and for the co-processor• HPC market target

• Knights Corner (First Implementation):• 50 x86 cores: four threads, 64KB L1, 256KB L2 cache, 512-bit

vector unit, GDDR5 memory, PCI Express 2.0

Focus on the FieldFrancesco Corazza

Page 17: Exploit the Integrated  Graphics in Packet Processing

21

Focus on the field

• What kind of application is packet processing?

• Which features differentiate them from general computing?

• What is the hardware best fit on these applications?

• What is the hardware most profitable for these app?

• How convenient hardware can be exploited in these app?• GPGPU• DirectCompute• OpenCL

Focus on the FieldFrancesco Corazza

Page 18: Exploit the Integrated  Graphics in Packet Processing

22

GPGPU – Overview • General-Purpose computing on graphics processing units

• Programming GPUs through accessible programming interfaces and industry-standard languages such as C

• Allows software developers to use stream processing on non-graphics data

• Competing interfaces• Nvidia Compute Unified Device Architecture (CUDA)• AMD Stream (now joined into OpenCL)• Microsoft DirectCompute (new subset of DirectX10/11 APIs)

• Convergence towards standardization (like OpenGL)• Khronos Group OpenCL

These frameworks lye just above hardware…

Focus on the FieldFrancesco Corazza

Page 19: Exploit the Integrated  Graphics in Packet Processing

23

GPGPU – Layer representation

Focus on the FieldFrancesco Corazza

Accelerator, Brook+, Rapidmind, Ct

MKL, ACML, cuFFT, D3DX, etc.

Media playback or processing, media UI, recognition, etc. Technical

DirectCompute, CUDA, CAL, OpenCL, LRB Native, etc.

CPU, GPU, LarrabeenVidia, Intel, AMD, S3, etc.

Applications

Processors

Domain Libraries

Domain Languages

Compute Languages

Page 20: Exploit the Integrated  Graphics in Packet Processing

25

GPGPU – Analysis• CUDA

• Tight hardware integration• Depence on Nvidia hardware

• OpenCL • Give up lower-level hooks into the architecture • Heterogeneous computational resources• Integration in the Khronos family (eg. OpenGL)

• DirectCompute• Only Windows (Wine/Mono are immature)• Integration in DirectX APIs• GPGPU under the hood of Windows 7

For their spread, we are going to cover the latter two languages…

Focus on the FieldFrancesco Corazza

Page 21: Exploit the Integrated  Graphics in Packet Processing

26

DirectCompute

Exposes the compute functionality of the GPU as a new type of shader (tool that determines the final appearance of an object's surface)

• Compute Shader • Delivers the performance of 3-D games to new applications

• Rendering integration• Demonstrates tight integration between computation and rendering

• Supported by all processor vendors• DirectX 10.1/11.0 respectively support Compute Shader 4.0/5.0

• Scalable parallel processing model• Code should scale for several generations

Focus on the FieldFrancesco Corazza

Page 22: Exploit the Integrated  Graphics in Packet Processing

27

DirectCompute – Rendering Pipeline

Focus on the FieldFrancesco Corazza

Render scene

Write out scene image

Use Compute for image post-processing

Output final image

Page 23: Exploit the Integrated  Graphics in Packet Processing

30

DirectCompute – Programming Model

Threads in the same group run concurrently

Focus on the FieldFrancesco Corazza

Dispatch• 3D grid of thread groups

Thread Group • 3D grid of threads • numThreads(nX, nY, nZ)

Thread• One invocation of a shader

Page 24: Exploit the Integrated  Graphics in Packet Processing

31

DirectCompute – Execution Model

• A thread is executed by a scalar processors

• A thread group is executed on a multiprocessor

• A compute shader kernel is launched as a grid of thread-groups (Only one grid of thread groups can execute on a device at one time)

Focus on the FieldFrancesco Corazza

Page 25: Exploit the Integrated  Graphics in Packet Processing

35

DirectCompute – Example HLSL codestruct BufferStruct{ uint4 color;};

// group size

#define thread_group_size_x 4

#define thread_group_size_y 4

RWStructuredBuffer<BufferStruct> g_OutBuff;

/* This is the number of threads in a thread group, 4x4x1 in this example case */

// e.g.: [numthreads( 4, 4, 1 )]

[numthreads( thread_group_size_x, thread_group_size_y, 1 )]

void main( uint3 threadIDInGroup : SV_GroupThreadID, uint3 groupID : SV_GroupID, uint groupIndex : SV_GroupIndex, uint3 dispatchThreadID : SV_DispatchThreadID )

{

int N_THREAD_GROUPS_X = 16; // assumed equal to 16 in dispatch(16,16,1)

int stride = thread_group_size_x * N_THREAD_GROUPS_X;

// buffer stide, assumes data stride = data width (i.e. no padding)

int idx = dispatchThreadID.y * stride + dispatchThreadID.x;

float4 color = float4(groupID.x, groupID.y, dispatchThreadID.x, dispatchThreadID.y);

g_OutBuff[ idx ].color = color;

}

Focus on the FieldFrancesco Corazza

Page 26: Exploit the Integrated  Graphics in Packet Processing

36

OpenCL – Overview

Open Computing Language• Access to heterogeneous computational resources• Parallel execution on single or multiple processors

• GPU, CPU, GPU + CPU or multiple GPUs

• Desktop and Handheld Profiles• Work with graphics APIs

• OpenGL

• C99 with extensions• Familiar to developers• Rich set of built-in functions• Easy to develop data- and task- parallel compute programs• Defines hardware and numerical precision requirements

Focus on the FieldFrancesco Corazza

Page 27: Exploit the Integrated  Graphics in Packet Processing

37

OpenCL – Execution Model (I)• Work item

• Basic unit of work on an OpenCL device

• Kernel• Basic unit of executable code • Similar to a C function• Data-parallel or task-parallel

• Program• Collection of kernels and functions• Analogous to a dynamic library

• Context • Environment within which work- items executes

• Applications • Queue kernel execution instances

• In-order: one queue to a device

• Executed in-order or out-of-order

Focus on the FieldFrancesco Corazza

Page 28: Exploit the Integrated  Graphics in Packet Processing

43

OpenCL – Coding (I)• Work-item

• Smallest execution entity• Every time a Kernel is launched, lots of work-items (a number specified by the

programmer) are launched, each one executing the same code • Unique ID

• Accessible from the kernel• Used to distinguish the data to be processed by each work-item

• Work-group• Allow communication and cooperation between work-items • Reflect work-items organization

• (N-dimensional grid of work-groups, N = 1, 2 or 3)• Independent element of execution in N-D domain

• ND-Range• Computation domain (Organization level)• Specify how work-groups are organized

• (N-dimensional grid of work-groups, N = 1, 2 or 3)• Defines the total number of work-items that execute in parallel

Focus on the FieldFrancesco Corazza

Page 29: Exploit the Integrated  Graphics in Packet Processing

44

OpenCL – Coding (II)

Focus on the FieldFrancesco Corazza

Page 30: Exploit the Integrated  Graphics in Packet Processing

45

OpenCL – Coding (III)

Process a 1024 x 1024 imageGlobal problem dimensions:

• 1024 x 1024 = 1 kernel execution per pixel• 1,048,576 total executions

Focus on the FieldFrancesco Corazza

scal

ar

data

-par

alle

lvoid scalar_mul ( int n, const float *a, const float *b, float *result){

int i;for (i=0; i<n; i++)result[i] = a[i] * b[i];

}

kernel void dp_mul(global const float *a,global const float *b, global float *result ) {

int id = get_global_id(0);result[id] = a[id] * b[id];

}// execute dp_mul over “n” work-items

Page 31: Exploit the Integrated  Graphics in Packet Processing

FOCUS ONINTEGRATED GRAPHICS

Page 32: Exploit the Integrated  Graphics in Packet Processing

47

CPU+GPU solutions

The architectures involved are:• Intel Core 2° Generation (Sandy Bridge)• Intel Atom E600 Series (Tunnel Creek)• Nvidia Tegra (Tegra 2)• AMD Fusion

Let’s compare them…

Focus on Integrated GraphicsFrancesco Corazza

Page 33: Exploit the Integrated  Graphics in Packet Processing

48

CPU+GPU solutions

Focus on Integrated GraphicsFrancesco Corazza

Market Target Release Date

Desktop / Hi-End 01/2011

Mobile / Industrial embedded

11/2010

Mobile / Tablets 01/2010

Consumer / Desktop 01/2011

Page 34: Exploit the Integrated  Graphics in Packet Processing

49

Focus on Integrated Graphics

• Intel Core 2° Generation (Sandy Bridge)• Features• Integrated GPU• AVX (Advanced Vector Extensions)

• Intel Atom E600 Series (Tunnel Creek)• Nvidia Tegra (Tegra 2)

• AMD Fusion

Focus on Integrated GraphicsFrancesco Corazza

Page 35: Exploit the Integrated  Graphics in Packet Processing

50

Sandy Bridge – Features (I) • CPU die redesigned

• Chip’s northbridge and GPU are both on-die (in the previous versions they were on a physically separate chip)

• LLC (Last Level Cache, formerly L3 Cache) • Thanks to new ring bus LLC is shared amongst all components,

including the GPU• Each individual core had its own private path to the LLC cache

• Unified Memory Architecture (UMA)• Architecture where the graphics subsystem does not have

exclusive dedicated memory and uses the host system’s memory• Dynamic Video Memory Technology (DVMT)

• Hyper Threading

Focus on Integrated GraphicsFrancesco Corazza

Page 36: Exploit the Integrated  Graphics in Packet Processing

51

Sandy Bridge – Features (II)• Turbo Boost Technology 2.0

• Adjust the processor core and GPU frequencies to increase performance and maintain the allotted power/thermal budget

• Processor can increase individual core speed or graphics speed as the workload dictates

• Developers cannot directly control it

• AVX (Advanced Vector eXtension)• Extends SIMD instructions from 128 bits to 256 bits. • AVX enables a single instruction to work on eight floating points at

a time instead of the four that the current SIMD provides• Increased processor performance with minimal power gains

(HUGI: Hurry Up And Get Idle)

Next diagram shows the integration that Intel have reached…

Focus on Integrated GraphicsFrancesco Corazza

Page 37: Exploit the Integrated  Graphics in Packet Processing

52

Sandy Bridge – Block Diagram

Now we have to zoom in into the graphic processor…

Focus on Integrated GraphicsFrancesco Corazza

Page 38: Exploit the Integrated  Graphics in Packet Processing

53

Sandy Bridge – Integrated GPU (I)

Focus on Integrated GraphicsFrancesco Corazza

Page 39: Exploit the Integrated  Graphics in Packet Processing

54

Sandy Bridge – Integrated GPU (II)• DirectCompute support

• DirectX 10.1• The internal ISA maps one-to-one with most DirectX10 API

instructions resulting in a very CISC-like architecture

• Execution Unit (EU)• The pipeline decoder uses only fixed-type function logic to limit the

overall power consumption (unlike NVIDIA and AMD that have programmable stream processors)

• Each EU can dual issue picking instructions from multiple threads• Transcendental math is handled by hardware in the EU and its

performance has been sped up considerably

GPU’s parallel capabilities are exploited thanks DirectCompute, but what about CPU?

Focus on Integrated GraphicsFrancesco Corazza

Page 40: Exploit the Integrated  Graphics in Packet Processing

55

AVX – Overview

Some assembly instructions can show the power of AVX…

Francesco Corazza Focus on Integrated Graphics

•KEY FEATURES• Wider Vectors

• Increased from 128 to 256 bit• Two 128-bit load ports

• Enhanced Data Rearrangement• Use the new 256 bit primitives to broadcast, mask loads and stores and data permutes

• Three and four Operands• Non Destructive Source for both AVX 128 and AVX 256

• Flexible unaligned memory access support• Extensible new opcode (VEX)

•BENEFITS• Higher peak FLOPs with good power efficiency• Organize, access and pull only necessary data more quickly and efficiently• Fewer register copies, better register use for both vector and scalar code• More opportunities to fuse load and compute operations• Code size reduction

Page 41: Exploit the Integrated  Graphics in Packet Processing

56

AVX – Instructions (I)

Focus on Integrated GraphicsFrancesco Corazza

Page 42: Exploit the Integrated  Graphics in Packet Processing

57

AVX – Instructions (II)

Focus on Integrated GraphicsFrancesco Corazza

Page 43: Exploit the Integrated  Graphics in Packet Processing

58

AVX – Code Example (I)H

igh

leve

l cod

e:

Ass

embl

y:

Focus on Integrated GraphicsFrancesco Corazza

#include <immintrin.h>

void foo(float *a, float *b, float *r){

__m256 s1, s2, res;

s1 = _mm256_loadu_ps(a);s2 = _mm256_loadu_ps(b);

res = _mm256_add_ps(s1, s2); _mm256_storeu_ps(r, res);

}

; -- Begin _fooALIGN 16 PUBLIC _foo

_foo PROC NEAR; parameter 1: 4 + esp ; parameter 2: 8 + esp ; parameter 3: 12 + esp$B2$1: ; Preds $B2$0

mov eax, DWORD PTR [4+esp] mov edx, DWORD PTR [8+esp] mov ecx, DWORD PTR [12+esp] vmovups ymm0, YMMWORD PTR [eax] vaddps ymm1, ymm0, YMMWORD PTR [edx] vmovups YMMWORD PTR [ecx], ymm1; LOE ebx ebp esi edi

$B2$2: ; Preds $B2$1ret ;10.1ALIGN 16

; LOE_foo ENDP ;_foo ENDS

Page 44: Exploit the Integrated  Graphics in Packet Processing

61

AVX – Benchmarks

Focus on Integrated GraphicsFrancesco Corazza

Page 45: Exploit the Integrated  Graphics in Packet Processing

62

AVX – Benchmarks

Focus on Integrated GraphicsFrancesco Corazza

SIMD processing works best with data-parallel applications where the data is arranged in a

structure of array (SOA) format. Graphics and image processing applications are often highly parallel and

well-structured, and thus are typically good candidates for SIMD processing. Geometry or mesh

data, on the other hand, is not always uniformly structured in a neat grid.

Page 46: Exploit the Integrated  Graphics in Packet Processing

63

Sandy Bridge – Conclusion • Interesting features for packet processing

• Integrated Memory controller• DirectCompute• AVX

• CPU+GPU integration is only on the physical layer• Packet processing can exploit CPU or GPU• Unpredictable evolution

• DirectCompute could exploit CPU• AVX could exploit GPU

• Next Ivy Bridge will support both OpenCL and DirectX11

Focus on Integrated GraphicsFrancesco Corazza

Page 47: Exploit the Integrated  Graphics in Packet Processing

64

Focus on Integrated Graphics

• Intel Core 2° Generation (Sandy Bridge)

• Intel Atom E600 Series (Tunnel Creek)• Features• Block Diagram• Customization

• Nvidia Tegra (Tegra 2)

• AMD Fusion

Focus on Integrated GraphicsFrancesco Corazza

Page 48: Exploit the Integrated  Graphics in Packet Processing

65

Atom E600 – Features (I) • SoC (System on Chip)• Power optimized

• Fanless performance

• I/O flexible and open• Flexible application Specific Needs• PCIe instead of proprietary FSB

• 7 years long life support

• Hyper-Threading Technology• Two logical processors

• SSE3 (Streaming SIMD Extensions)• Support for SIMD intructions

Focus on Integrated GraphicsFrancesco Corazza

Page 49: Exploit the Integrated  Graphics in Packet Processing

66

Atom E600 – Features (II) • Power saving

• Intel SpeedStep Technology• Enables the operating system to program a processor to transition to

lower frequency and/or voltage levels while executing a workload

• Deep power down technology• Able to reduce static power consumption by turning off power to cache

and other sub-systems in the processor.

• In-order processing• Guarantees greater power efficiency, CPU will not reorder an instruction

stream to extract instruction-level parallelism

• DirectCompute support• Tunnel Creek supports only DirectX9

The next diagram shows the insight of the Atom architecture…

Focus on Integrated GraphicsFrancesco Corazza

Page 50: Exploit the Integrated  Graphics in Packet Processing

67

Atom E600 – Block Diagram

Atom does not support DirectCompute, so we have to concentrate on the great

flexibility of the architecture…

Focus on Integrated GraphicsFrancesco Corazza

Page 51: Exploit the Integrated  Graphics in Packet Processing

68

Atom E600 – Customization • Open connection

• Developers can attach the processor to a variety of chipsets• application-specific third-party

chipsets• FPGAs• ASIC

• Processor can be used without a chipset (limited I/O needs)• The processor’s four PCIe

connections can attach to discrete PCIe peripherals such as Ethernet controllers

Focus on Integrated GraphicsFrancesco Corazza

Page 52: Exploit the Integrated  Graphics in Packet Processing

69

Atom E600 – Conclusion • Interesting features for packet processing

• Power saving features• Long support • Flexible Architecture

• Any support to GPGPU• Old school GPGPU

• Use OpenGL ES 2.0 shaders (programmable shaders)• Rewrite the code as a fragment shader

• Wait for Cedar Trail (2011 – not yet released)• DirectX 10.1

Focus on Integrated GraphicsFrancesco Corazza

Page 53: Exploit the Integrated  Graphics in Packet Processing

70

Focus on Integrated Graphics

• Intel Core 2° Generation (Sandy Bridge)

• Intel Atom E600 Series (Tunnel Creek)

• Nvidia Tegra (Tegra 2)• Features

• Block Diagram

• AMD Fusion

Focus on Integrated GraphicsFrancesco Corazza

Page 54: Exploit the Integrated  Graphics in Packet Processing

71

Tegra – Features • SoC (System-on-a-chip)

• ARM CPU Dual Core• GeForce GPU

• ULP (Ultra-low power consumption)• Graphics support

• No DirectX support• No CUDA support• OpenGL ES 2.0 support

The next diagram shows quantitatively a view of a Tegra chip…

Focus on Integrated GraphicsFrancesco Corazza

Page 55: Exploit the Integrated  Graphics in Packet Processing

72

Tegra – Block Diagram

Focus on Integrated GraphicsFrancesco Corazza

Page 56: Exploit the Integrated  Graphics in Packet Processing

73

Tegra – Conclusion• Interesting features for packet processing

• Integrated Memory controller• Low power consumption

• Any support to GPGPU• Old school GPGPU

• Use OpenGL ES 2.0 shaders (programmable shaders)• Rewrite the code as a fragment shader

• Wait for Tegra 3 (third quarter of 2011)• DirectX 11• CUDA

Focus on Integrated GraphicsFrancesco Corazza

Page 57: Exploit the Integrated  Graphics in Packet Processing

74

Focus on Integrated Graphics

• Intel Core 2° Generation ( Sandy Bridge)

• Intel Atom E600 Series (Tunnel Creek)

• Nvidia Tegra (Tegra 2)

• AMD Fusion• AMD Vision

• Features

• APU Roadmap

• Integration Highlights

Focus on Integrated GraphicsFrancesco Corazza

Page 58: Exploit the Integrated  Graphics in Packet Processing

75

Fusion – AMD Vision

Fusion is a step-forward technology:

AMD have realized this heterogeneous architecture developing APUs…

Focus on Integrated GraphicsFrancesco Corazza

Page 59: Exploit the Integrated  Graphics in Packet Processing

76

Fusion – Features (I)

Focus on Integrated GraphicsFrancesco Corazza

Video

Page 60: Exploit the Integrated  Graphics in Packet Processing

77

Fusion – Features (II) • DirectCompute support (DirectX 11)• OpenCL 1.1

• Additive capabilities of an APU and a discrete graphics solution

• Power-oriented benefits

• Massive SIMD GPU (SSE5)• Programmable scalar and vector

processor cores

• APU family• Bulldozer (Sandy Bridge’s opponent)

• Performance and scalability

• Bobcat (Atom’s opponent)

Let’s compare this two solutions…

Focus on Integrated GraphicsFrancesco Corazza

Page 61: Exploit the Integrated  Graphics in Packet Processing

79

Fusion – Features (III)

The difference between Bulldozer/Bobcat is also the market target…

Focus on Integrated GraphicsFrancesco Corazza

Page 62: Exploit the Integrated  Graphics in Packet Processing

81

Fusion – APU roadmap

The high level of integration differentiate APUs from CPUs…

Focus on Integrated GraphicsFrancesco Corazza

Page 63: Exploit the Integrated  Graphics in Packet Processing

82

Fusion – Integration Highlights• Shared memory

• Lower latencies

• PCI Express • Cut down some latencies

• No discrete GPU, less• Cost• Power• Motherboard complexity

Focus on Integrated GraphicsFrancesco Corazza

Page 64: Exploit the Integrated  Graphics in Packet Processing

83

Fusion – Conclusion• Interesting features for packet processing

• OpenCL/DirectCompute/SSE5• Architecture tight integrated• New technology (First-Come-First-Served)

• OpenCL• Could be the “El Dorado” for packet processing

• CPU/GPU working in AND/OR configuration• Shared Memory• Embedded implementation of Fusion technology

• AMD declaredly support it to bring the power of heterogeneous computing mainstream

Focus on Integrated GraphicsFrancesco Corazza

Page 65: Exploit the Integrated  Graphics in Packet Processing

CONCLUSIONS

Page 66: Exploit the Integrated  Graphics in Packet Processing

85

Summary (I)This presentation has disclosed several ways of exploiting integrated graphics and, more generally, consumer architectures for packet processing:

• GPGPU-driven solutions• CUDA, OpenCL, DirectX11

• SIMD-driven solutions• Exploit very parallel operations through this SIMD implementation• AVX, SSE

• Custom hardware solutions• Design flexible modules tailored on specific needs• FPGA

The former solutions are the most in vogue at the moment…

ConclusionsFrancesco Corazza

Page 67: Exploit the Integrated  Graphics in Packet Processing

86

Summary (II)

ConclusionsFrancesco Corazza

Open CL SSE FPGA Direct

Compute Open

GL

X V(AVX)

X V V

XV

(SSE 3)

V X V

XV

(SSE 3)

X X V

VV

(SSE 5)

X V V

Page 68: Exploit the Integrated  Graphics in Packet Processing

87

Recommendations

Write directly parallel code is more efficient than hardware parallelization:

ConclusionsFrancesco Corazza

Page 69: Exploit the Integrated  Graphics in Packet Processing

THANK YOUQuestions?

Page 70: Exploit the Integrated  Graphics in Packet Processing

89

Bibliography• Lecture notes of course “Tecnologie per reti di calcolatori”• http://www.intel.com/technology/architecture-silicon/2ndgen/index.htm• http://www.intel.com/technology/atom/index.htm• http://www.intel.com/technology/architecture-silicon/mic/index.htm• http://sites.amd.com/us/fusion/apu/pages/fusion.aspx• http://www.hwupgrade.it/articoli/cpu/2674/intel-sandy-bridge-analisi-dell-architettura_i

ndex.html• http://www.anandtech.com/show/3922/intels-sandy-bridge-architecture-exposed/• http://www.multicorepacketprocessing.com/ • http://www.nvidia.co.uk/object/tegra-2.html• http://www.tomshardware.com/reviews/sandy-bridge-fusion-nvidia-chipset,2763-6.ht

ml• http://www.tomshardware.com/reviews/amd-fusion-brazos-zacate,2786-2.html• http://gpgpu.org/• http://channel9.msdn.com/tags/DirectCompute-Lecture-Series/• http://gpgpu-computing.blogspot.com/• http://blogs.msdn.com/b/chuckw/archive/2010/07/14/directcompute.aspx• http://www.khronos.org/developers/resources/opencl/#ttutorials• http://www.youtube.com/watch?v=VIs1CxuUrpc&feature=related

Francesco Corazza