Top Banner
创创创创创 创创创创创创 FMM 创创创创 创创 创创创创创创 创创创创 2010
21

创新计算机体系结构设计的 FMM 算法分析

Feb 10, 2016

Download

Documents

yuma

创新计算机体系结构设计的 FMM 算法分析. 吕超 上海交通大学 软件学院 2010. 内容提要. 课题背景 前期工作 N-body 问题简介 FMM 算法分析 针对 FMM 优化的配置策略 结论. 课题背景. 项目来源: 新概念高效能计算机体系结构及系统研究开发 国家 863 计划重点项目( 2009AA012201 ) 上海市科委重大科技攻关项目( 08dz501600 ) 课题内容:新型体系结构设计的应用分析及前端设计 前期应用分析 体系结构的前端设计 编译器 / 软件平台设计 应用优化 - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 创新计算机体系结构设计的 FMM 算法分析

创新计算机体系结构设计的 FMM 算法分析吕超上海交通大学软件学院

2010

Page 2: 创新计算机体系结构设计的 FMM 算法分析

2

内容提要• 课题背景• 前期工作• N-body 问题简介• FMM 算法分析• 针对 FMM 优化的配置策略• 结论

Page 3: 创新计算机体系结构设计的 FMM 算法分析

3

课题背景• 项目来源:新概念高效能计算机体系结构及系统研究开发

• 国家 863 计划重点项目( 2009AA012201 )• 上海市科委重大科技攻关项目( 08dz501600 )

• 课题内容:新型体系结构设计的应用分析及前端设计• 前期应用分析• 体系结构的前端设计• 编译器 / 软件平台设计• 应用优化

• 主要目标:设计针对高性能计算的可重构专用处理机体系结构

Page 4: 创新计算机体系结构设计的 FMM 算法分析

4

前期工作• 高性能计算应用分析

• CT 和 MRI 的图像重建• 基于 SURF 算法的 图像局部特征提取与匹配

• 应用的模拟与优化• 基于多核 CPU 并行的 SURF 算法优化与分析• 基于 GPGPU 的 SURF 算法实现( CUDA-SURF )• 基于 CPU 和 GPU 异构平台的 SURF 算法优化

Page 5: 创新计算机体系结构设计的 FMM 算法分析

5

N-BODY 问题简介• 引入目的

• 作为体系结构的典型应用加以分析• 给出针对应用优化的体系结构设计策略

• N-body 问题 • 又称多体问题,是天体物理学、流体力学以及分子动力学的基本问题之一• 用来模拟一个系统中相互作用的粒子的运动规律• 高性能计算的典型应用

• 数学意义:一组已知初始值的常微分方程3

( ), 1,2,...,

ni j j i

i ii j

j i

mm q qm q j n

q q

Page 6: 创新计算机体系结构设计的 FMM 算法分析

6

N-BODY 问题简介(续)• 常见算法

• PP ( Particle to Particle )算法• 应用公式直接计算• 时间复杂度 O(N2)

• PM ( Particle Mesh Method )算法• 利用粒子网格,将多个点的作用看作整体(计算网格的势能)• 时间复杂度 O(NlogN)

• TM ( Tree Method )算法• 应用公式直接计算• 时间复杂度 O(N2)

Page 7: 创新计算机体系结构设计的 FMM 算法分析

OVERVIEW OF GPU ARCHITECTURE Graphics Pipeline / Programmable Hardware / Unified Shading Model / NVIDIA GeForce 8800 GTX

Page 8: 创新计算机体系结构设计的 FMM 算法分析

8

GRAPHICS PIPELINE• The Vertex/Geometry Stage

• transforms each vertex from object space into screen space

• assembles the vertices into triangles

• traditionally performs lighting calculations on each vertex.

• The Rasterization Stage• determines the screen positions

covered by each triangle• interpolates per-vertex

parameters across the triangle. • The Fragment/Pixel Stage

• computes the color for each fragment• The Composition/Display Stage

• assembles fragments into an image of pixels,

Page 9: 创新计算机体系结构设计的 FMM 算法分析

9

PROGRAMMABLE HARDWARE• In Programmable Graphics

Pipeline• User-defined vertex program• User-defined fragment program

• Limitations• Simple, incomplete instruction

sets.• Fragment program data types are

mostly fixed-point.• Limited number of instructions and

a small number of registers. • Limited number of inputs and

outputs• No conditional branching

Page 10: 创新计算机体系结构设计的 FMM 算法分析

10

UNIFIED SHADER MODEL

• Unified Shader Model must• Have at least 65 k static instructions and unlimited dynamic instructions• Support both 32-bit integers and 32-bit floating-point numbers• Allow an arbitrary number of both direct and indirect reads from global

memory (texture)• Support dynamic flow control in the form of loops and branches

• Current GPUs support the unified Shader Model 4.0 on both vertex and fragment shaders

Page 11: 创新计算机体系结构设计的 FMM 算法分析

11

UNIFIED SHADING ARCHITECTURE

• Green grid – Streaming Multiprocessor• Grid of purple board – Thread Processor• 16 streaming processors of 8 thread processors each.

NVIDIA GeForce 8800 GTX Architecture

Page 12: 创新计算机体系结构设计的 FMM 算法分析

12

UNIFIED SHADING ARCHITECTURE (CON.)

• One thread processor contains a pair of streaming multiprocessors• One streaming multiprocessor contains shared

instruction and data caches, control logic, a 16 KB shared memory, eight stream processors, and two special function units.

NVIDIA GeForce 8800 GTX – Thread Processor

Page 13: 创新计算机体系结构设计的 FMM 算法分析

HOW TO PROGRAM GPGPU GPU Programming Model / GPU Programming Flow Control / GPGPU Techniques / GPGPU Applications

Page 14: 创新计算机体系结构设计的 FMM 算法分析

14

GPU PROGRAMMING MODEL• GPU programming model contains

• graphics API terminology• stream programming model

• A typical GPGPU program using fragment processor is structured as • Segment the general-purpose program into independent parallel sections

(kernels)• Specify the range of computation / the size of the output stream to invoke

a kernel • Use rasterizer to generate a fragment for every pixel location in the quad• Each of the generated fragments is then processed by the active kernel

fragment program• The output of the fragment program is a value (or vector of values) per

fragment

Page 15: 创新计算机体系结构设计的 FMM 算法分析

15

GPU PROGRAMMING FLOW CONTROL• Three basic implementations of data-parallel branching

• Predication• Both sides of branch are evaluated

• Multiple Instruction Multiple Data (MIMD) branching• Different processors flow different paths

• Single Instruction Multiple Data (SIMD) branching• If identical for all pixels in the group, only the taken side of the branch

must be evaluated.• if one or more of the processors evaluates the branch condition

differently, then both sides must be evaluated and the results predicated.

Better to Move Branching Up The Pipeline

Page 16: 创新计算机体系结构设计的 FMM 算法分析

16

GPGPU TECHNIQUES• Stream Operations:

• Map and Reduce – Straightforward [BFH∗04b] BUCK I., FOLEY T., HORN D., SUGERMAN J., FATAHALIAN K., HOUSTON M., HANRAHAN P.: Brook for GPUs: Stream computing on graphics hardware. ACM Transactions on Graphics 23, 3 (Aug. 2004), 777–786.

• Scatter and Gather – Avoid Scatter [Buc05b] BUCK I.: Taking the plunge into GPU computing. In GPU Gems 2, Pharr M., (Ed.). Addison Wesley, Mar. 2005, ch. 32, pp. 509–519.

• Scan – All-prefix-sums operation[HS86] [Ble90][Hor05][HSC*05][SLO06, GGK06]

• Filtering – Using a combination of scan and search, O(log(n)) archived. [Hor05] HORN D.: Stream reduction operations for GPGPU applications. In GPU Gems 2, Pharr M., (Ed.). Addison Wesley, Mar. 2005, ch. 36, pp. 573–589.

• Sort – Based on sorting networks, such as parallel bitonic merge sort [BP04,CND03,GZ06,KSW04,KW05a,PDC∗03, Pur04]

• Search – Binary search / Nearest neighbor search [Hor05, PDC∗03, Pur04] / [Ben75, FS05, PDC∗03, Pur04]

Page 17: 创新计算机体系结构设计的 FMM 算法分析

17

GPGPU TECHNIQUES (CON.)

• Data Structures• Iteration

• Dense structures supported straightforward• Sparse arrays• adaptive arrays, • and grid-of-list structures require more complex iteration constructs [BFGS03, KW03,

LKHW04].

• Generalized Arrays via Address Translation• Address translator converts between 1D array and 2D texture [LKO05,PBMH02]• Optimization techniques for pre-computing these address translation operations before

the fragment processor [BFGS03, CHL04, KW03,LKHW04]

• Differential Equations, Linear Algebra, Data Queries …

Page 18: 创新计算机体系结构设计的 FMM 算法分析

18

GPGPU APPLICATIONS• Physically Based Simulation• Signal and Image Processing

• Computer Vision• Image Processing• Signal Processing• Tone Mapping• Audio• Image / Video Processing

• Global Illumination• Ray tracing, photon mapping, radiosity, subsurface scattering …

• Geometric Computing• Databases and Data Mining

Page 19: 创新计算机体系结构设计的 FMM 算法分析

19

CONCLUSION• Highly parallel nature

• But currently only data-parallel for general purpose computation

• Many applications can be mapped on GPU • But no double-precision, scatter and efficient branching supported

• Program with graphics API• But hard to understand and use

• What we are looking forward to• More programmable and flexible hardware needed• High-level programming model needed

Page 20: 创新计算机体系结构设计的 FMM 算法分析

ANY QUESTION?

Page 21: 创新计算机体系结构设计的 FMM 算法分析

THANK YOU The End