创新计算机体系结构设计的 FMM 算法分析

创新计算机体系结构设计的 FMM 算法分析吕超上海交通大学软件学院

2010

2

内容提要• 课题背景• 前期工作• N-body 问题简介• FMM 算法分析• 针对 FMM 优化的配置策略• 结论

3

课题背景• 项目来源：新概念高效能计算机体系结构及系统研究开发

• 国家 863 计划重点项目（ 2009AA012201 ）• 上海市科委重大科技攻关项目（ 08dz501600 ）

• 课题内容：新型体系结构设计的应用分析及前端设计• 前期应用分析• 体系结构的前端设计• 编译器 / 软件平台设计• 应用优化

• 主要目标：设计针对高性能计算的可重构专用处理机体系结构

4

前期工作• 高性能计算应用分析

• CT 和 MRI 的图像重建• 基于 SURF 算法的图像局部特征提取与匹配

• 应用的模拟与优化• 基于多核 CPU 并行的 SURF 算法优化与分析• 基于 GPGPU 的 SURF 算法实现（ CUDA-SURF ）• 基于 CPU 和 GPU 异构平台的 SURF 算法优化

5

N-BODY 问题简介• 引入目的

• 作为体系结构的典型应用加以分析• 给出针对应用优化的体系结构设计策略

• N-body 问题 • 又称多体问题，是天体物理学、流体力学以及分子动力学的基本问题之一• 用来模拟一个系统中相互作用的粒子的运动规律• 高性能计算的典型应用

• 数学意义：一组已知初始值的常微分方程3

( ), 1,2,...,

ni j j i

i ii j

j i

mm q qm q j n

q q

6

N-BODY 问题简介（续）• 常见算法

• PP （ Particle to Particle ）算法• 应用公式直接计算• 时间复杂度 O(N2)

• PM （ Particle Mesh Method ）算法• 利用粒子网格，将多个点的作用看作整体（计算网格的势能）• 时间复杂度 O(NlogN)

• TM （ Tree Method ）算法• 应用公式直接计算• 时间复杂度 O(N2)

OVERVIEW OF GPU ARCHITECTURE Graphics Pipeline / Programmable Hardware / Unified Shading Model / NVIDIA GeForce 8800 GTX

8

GRAPHICS PIPELINE• The Vertex/Geometry Stage

• transforms each vertex from object space into screen space

• assembles the vertices into triangles

• traditionally performs lighting calculations on each vertex.

• The Rasterization Stage• determines the screen positions

covered by each triangle• interpolates per-vertex

parameters across the triangle. • The Fragment/Pixel Stage

• computes the color for each fragment• The Composition/Display Stage

• assembles fragments into an image of pixels,

9

PROGRAMMABLE HARDWARE• In Programmable Graphics

Pipeline• User-defined vertex program• User-defined fragment program

• Limitations• Simple, incomplete instruction

sets.• Fragment program data types are

mostly fixed-point.• Limited number of instructions and

a small number of registers. • Limited number of inputs and

outputs• No conditional branching

10

UNIFIED SHADER MODEL

• Unified Shader Model must• Have at least 65 k static instructions and unlimited dynamic instructions• Support both 32-bit integers and 32-bit floating-point numbers• Allow an arbitrary number of both direct and indirect reads from global

memory (texture)• Support dynamic flow control in the form of loops and branches

• Current GPUs support the unified Shader Model 4.0 on both vertex and fragment shaders

11

UNIFIED SHADING ARCHITECTURE

• Green grid – Streaming Multiprocessor• Grid of purple board – Thread Processor• 16 streaming processors of 8 thread processors each.

NVIDIA GeForce 8800 GTX Architecture

12

UNIFIED SHADING ARCHITECTURE (CON.)

• One thread processor contains a pair of streaming multiprocessors• One streaming multiprocessor contains shared

instruction and data caches, control logic, a 16 KB shared memory, eight stream processors, and two special function units.

NVIDIA GeForce 8800 GTX – Thread Processor

HOW TO PROGRAM GPGPU GPU Programming Model / GPU Programming Flow Control / GPGPU Techniques / GPGPU Applications

14

GPU PROGRAMMING MODEL• GPU programming model contains

• graphics API terminology• stream programming model

• A typical GPGPU program using fragment processor is structured as • Segment the general-purpose program into independent parallel sections

(kernels)• Specify the range of computation / the size of the output stream to invoke

a kernel • Use rasterizer to generate a fragment for every pixel location in the quad• Each of the generated fragments is then processed by the active kernel

fragment program• The output of the fragment program is a value (or vector of values) per

fragment

15

GPU PROGRAMMING FLOW CONTROL• Three basic implementations of data-parallel branching

• Predication• Both sides of branch are evaluated

• Multiple Instruction Multiple Data (MIMD) branching• Different processors flow different paths

• Single Instruction Multiple Data (SIMD) branching• If identical for all pixels in the group, only the taken side of the branch

must be evaluated.• if one or more of the processors evaluates the branch condition

differently, then both sides must be evaluated and the results predicated.

Better to Move Branching Up The Pipeline

16

GPGPU TECHNIQUES• Stream Operations:

• Map and Reduce – Straightforward [BFH∗04b] BUCK I., FOLEY T., HORN D., SUGERMAN J., FATAHALIAN K., HOUSTON M., HANRAHAN P.: Brook for GPUs: Stream computing on graphics hardware. ACM Transactions on Graphics 23, 3 (Aug. 2004), 777–786.

• Scatter and Gather – Avoid Scatter [Buc05b] BUCK I.: Taking the plunge into GPU computing. In GPU Gems 2, Pharr M., (Ed.). Addison Wesley, Mar. 2005, ch. 32, pp. 509–519.

• Scan – All-prefix-sums operation[HS86] [Ble90][Hor05][HSC*05][SLO06, GGK06]

• Filtering – Using a combination of scan and search, O(log(n)) archived. [Hor05] HORN D.: Stream reduction operations for GPGPU applications. In GPU Gems 2, Pharr M., (Ed.). Addison Wesley, Mar. 2005, ch. 36, pp. 573–589.

• Sort – Based on sorting networks, such as parallel bitonic merge sort [BP04,CND03,GZ06,KSW04,KW05a,PDC∗03, Pur04]

• Search – Binary search / Nearest neighbor search [Hor05, PDC∗03, Pur04] / [Ben75, FS05, PDC∗03, Pur04]

17

GPGPU TECHNIQUES (CON.)

• Data Structures• Iteration

• Dense structures supported straightforward• Sparse arrays• adaptive arrays, • and grid-of-list structures require more complex iteration constructs [BFGS03, KW03,

LKHW04].

• Generalized Arrays via Address Translation• Address translator converts between 1D array and 2D texture [LKO05,PBMH02]• Optimization techniques for pre-computing these address translation operations before

the fragment processor [BFGS03, CHL04, KW03,LKHW04]

• Differential Equations, Linear Algebra, Data Queries …

18

GPGPU APPLICATIONS• Physically Based Simulation• Signal and Image Processing

• Computer Vision• Image Processing• Signal Processing• Tone Mapping• Audio• Image / Video Processing

• Global Illumination• Ray tracing, photon mapping, radiosity, subsurface scattering …

• Geometric Computing• Databases and Data Mining

19

CONCLUSION• Highly parallel nature

• But currently only data-parallel for general purpose computation

• Many applications can be mapped on GPU • But no double-precision, scatter and efficient branching supported

• Program with graphics API• But hard to understand and use

• What we are looking forward to• More programmable and flexible hardware needed• High-level programming model needed

ANY QUESTION?

THANK YOU The End

创新计算机体系结构设计的 FMM 算法分析

Documents