Top Banner
MUDA MUltiple Data Accelerator language Project Overview Feb 24, 2008 Syoyo FUJITA
23

Muda Proposal

May 06, 2015

Download

Technology

Syoyo Fujita
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Muda Proposal

MUDAMUltiple Data Accelerator language

Project OverviewFeb 24, 2008

Syoyo FUJITA

Page 2: Muda Proposal

?

Page 3: Muda Proposal

Nikkei 225 index

Page 4: Muda Proposal

?

Page 5: Muda Proposal

2007 Feb/2008

+800 %Mac Pro octa

204 Gflops

Geforce 9800 GX2 rumor

1 TFlops?( 3x of G80)500 GFlops? (+50% of G80)

No update !

?

PS3179.2 Gflops

GPU slumpsCPU soars

Page 6: Muda Proposal

Nikkei 225 index

Page 7: Muda Proposal

Nikkei 225 index

Future of GPU trend

Subprime shock!Credit boom ends!US economy declines!Green IT!

Page 8: Muda Proposal

CPU GPU

GPGPU

Acceleratedcomputing

many-core

Page 9: Muda Proposal

CPU GPU

GPGPU

Acceleratedcomputing

many-core

NO!

GPGPU was dead!!GPU will be dead soon!!

Page 10: Muda Proposal

Why GPU -> GPGPU is BAD

• Larger latency : host <-> PCI-ex

• Internal architecture is black box

• Only GPU maker knows it

• Larger cost of branching

• Debugger?

• Program only runs on specific GPU maker’s GPU

• Not portable.

Page 11: Muda Proposal

Why CPU -> Accelerated computing is GOOD

• Easy to program

• CPU maker provides good internal spec documentation

• Fast execution of branching

• gdb :-)

• Portable & Versatile

Page 12: Muda Proposal

CPU

Acceleratedcomputing

many-core

MUDA

Page 13: Muda Proposal

MUDA’s goal

• Withdraw CPU’s maximum floating point performance for large data

• SIMD

• Cache optimized computation

Page 14: Muda Proposal

MUDA example

vec sqrtmu(vec x){ vec y0, y0x, y0xhalf; vec oneish = bit(0x3f800001);

y0 = rsqrt(x); y0x = y0 * x; y0xhalf = 0.5 * y0x; return ((oneish - y0 * y0x) * y0xhalf + y0x); }

MUDA code

Page 15: Muda Proposal

__m128 sqrtmu (const __m128 * x) { __m128 y0 ; __m128 y0x ; __m128 y0xhalf ; const __m128 t_vec4 = (__m128)_mm_set1_epi32( 1065353217) ; __m128 oneish = t_vec4 ; const __m128 t_vec6 = (*x) ; const __m128 t_vec5 = _mm_rsqrt_ps( t_vec6) ; y0 = t_vec5 ; const __m128 t_vec8 = y0 ; const __m128 t_vec9 = (*x) ; const __m128 t_vec7 = _mm_mul_ps( t_vec8 , t_vec9 ) ; y0x = t_vec7 ; const float t_float13 = 0.5 ; const float t_float12 = t_float13 ; const __m128 t_vec10 = _mm_set_ps1( t_float12 ) ; const __m128 t_vec14 = y0x ; const __m128 t_vec11 = _mm_mul_ps( t_vec10 , t_vec14 ) ; y0xhalf = t_vec11 ; const __m128 t_vec19 = oneish ; const __m128 t_vec20 = y0 ; const __m128 t_vec21 = y0x ; const __m128 t_vec15 = _mm_mul_ps( t_vec20 , t_vec21 ) ; const __m128 t_vec16 = _mm_sub_ps( t_vec19 , t_vec15 ) ; const __m128 t_vec22 = y0xhalf ; const __m128 t_vec17 = _mm_mul_ps( t_vec16 , t_vec22 ) ; const __m128 t_vec23 = y0x ; const __m128 t_vec18 = _mm_add_ps( t_vec17 , t_vec23 ) ; return t_vec18 ;}

x86/SSE output

Page 16: Muda Proposal

Why MUDA?

Page 17: Muda Proposal

No unified way to describe SIMD op

• SSE: _mm_add_ps()

• AltiVec: vec_add

• SPE: spu_add

Page 18: Muda Proposal

CPU ISA changes frequently

• SSE2(2000), SSE3(2004), SSE4(2006)

• SSE5 and Coming New CPU design(?)

• 8-element SIMD?, no SIMD in the future CPU?

• Keeping up with them is hard and not productive. Waste of your time.

Page 19: Muda Proposal

MUDA MUDAcompiler

SSE2 C code

SSE4 C code

VMX C code

LLVM IR

Portable, CPU independent

description

CPU or Arch dependentcode

Page 20: Muda Proposal

Status

• SSE2 backend : 75 %

• SSE4 backend : 0 %

• VMX backend : 20 %

• LLVM IR backend : 30 %

• SIMD math function for MUDA : 5 %

• Automatic optimizer : TODO

= I’m currently working on

Page 21: Muda Proposal

Future direction

• Cache miss analysis and memory access optimization

• Valgrind, Cache Miss Equation(CME)

• Automatic optimization

• Such like FFTW, ATLAS and Spiral are doing

• Automatic error measurement for floating point computation

• Interval Arithmetic, Affine Arithmetic, Gappa

Page 22: Muda Proposal

Performance gap

0

25

50

75

100

SIMD Memory

Scalar:SIMD =

1:4

cache miss:cache hit =

1:100

Better

Page 23: Muda Proposal

Performance gap

0

25

50

75

100

SIMD Memory

Scalar:SIMD =

1:4

cache miss:cache hit =

1:100

Better

Optimizing memory access is much more important than SIMDization