Top Banner
Robert Liao Tracy Wang CS252 Spring 2007
24

Robert Liao Tracy Wang CS252 Spring 2007. Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

Jan 03, 2016

Download

Documents

Leon Conley
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Robert Liao Tracy Wang CS252 Spring 2007. Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

Robert LiaoTracy Wang

CS252 Spring 2007

Page 2: Robert Liao Tracy Wang CS252 Spring 2007. Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

OverviewTraditional GPU ArchitectureThe NVIDIA G80 ProcessorCUDA (Compute Unified Device Architecture)LAPACKPerformance and Issues

Page 3: Robert Liao Tracy Wang CS252 Spring 2007. Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

A Quick Note on Naming“G80” is the codename for the GPU found in

the following graphics cards.NVIDIA GeForce 8 Series Graphics CardsNVIDIA Quadro FX 4600NVIDIA Quadro FX 5600

Page 4: Robert Liao Tracy Wang CS252 Spring 2007. Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

Traditional GPUs

From Intel Corporation

Page 5: Robert Liao Tracy Wang CS252 Spring 2007. Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

Traditional GPUsGPUs talk Polygons

Vertex Processor

FromCPU

Pixel Fragmenti

ngCreation

Merge Output

ProcessFragment

sDisplay

Page 6: Robert Liao Tracy Wang CS252 Spring 2007. Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

Traditional GPUsOpenGL and DirectX abstract this away.

Vertex Processor

FromCPU

Pixel Fragmenti

ngCreation

Merge Output

ProcessFragment

sDisplay

Page 7: Robert Liao Tracy Wang CS252 Spring 2007. Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

The NVIDIA G80 ArchitectureReconfigurable Processor Pipeline

From NVIDIA

Page 8: Robert Liao Tracy Wang CS252 Spring 2007. Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

G80 History and SpecificationsProject Started in Summer of 2002.128 Compute Cores

1.35 GHz in the GeForce 8800Floating Point Ops

Stream Processor ArchitectureOne Computing Unit Streams into another

Computing Unit

Page 9: Robert Liao Tracy Wang CS252 Spring 2007. Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

The CUDA Interface to the G80Compute Unified Device ArchitectureC Interface for Performing Operations on the

NVIDIA ProcessorContains traditional C memory semantics

with the context of a GPU

Page 10: Robert Liao Tracy Wang CS252 Spring 2007. Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

Working with CUDACustom compiler provided to compile C code

that the GPU can understand.The API functions provide a whole host of

ways to interface with the GPU.CUDA Libraries are provided for common

tasks.CUDA Runtime helps management of

memory

No DirectX or OpenGL knowledge needed!

Page 11: Robert Liao Tracy Wang CS252 Spring 2007. Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

Working with CUDARunning C on the CPU Running C on the GPUmallocfreeCPU Code

cudaMalloccudaFreeGPU Code

Pointers on one side stay on one side.This will create issues for existing applications

Page 12: Robert Liao Tracy Wang CS252 Spring 2007. Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

LAPACKLinear Algebra PACKageImplemented in Fortran 77Interfaces with BLAS

(Basic Linear Algebra Subprograms)Professor James Demmel involved in Project

Page 13: Robert Liao Tracy Wang CS252 Spring 2007. Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

CLAPACKAn F2C’ed version of LAPACK.Very ugly! s_rsle(&io___8); do_lio(&c__3, &c__1, (char *)&nm, (ftnlen)sizeof(integer)); e_rsle(); if (nm < 1) {

s_wsfe(&io___10);do_fio(&c__1, " NM ", (ftnlen)4);do_fio(&c__1, (char *)&nm, (ftnlen)sizeof(integer));do_fio(&c__1, (char *)&c__1, (ftnlen)sizeof(integer));e_wsfe();nm = 0;fatal = TRUE_;

} else if (nm > 12) {s_wsfe(&io___11);do_fio(&c__1, " NM ", (ftnlen)4);do_fio(&c__1, (char *)&nm, (ftnlen)sizeof(integer));do_fio(&c__1, (char *)&c__12, (ftnlen)sizeof(integer));e_wsfe();nm = 0;

Page 14: Robert Liao Tracy Wang CS252 Spring 2007. Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

CUBLASNVIDIA’s CUDA Based Implementation of

BLASMany functions are similar, but argument

signatures are slightly differentAdds some other functions as well

cublasAlloccublasFree

CUBLAS lives in the GPU world

Page 15: Robert Liao Tracy Wang CS252 Spring 2007. Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

CLAPACK and CUBLASPutting them together is not as easy as just

linking CLAPACK to CUBLAS.Matrices and data structures must be moved

into GPU memory space.CLAPACK executes on the CPU.CUBLAS executes on the GPU.

CLAPACK Function

CUBLASMemory

copy CPU->GPU

Memory copy GPU->CPU

Page 16: Robert Liao Tracy Wang CS252 Spring 2007. Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

CLAPACK ConcentrationGeneral Solve

sgesvComputes solution to linear system of equations

A × X = BTo Solve, A is factored into three matrices, P, L,

and U. P = Permutation Matrix L = Lower Triangular U = Upper Triangular

Currently, our results cover the triangular factoring step

Page 17: Robert Liao Tracy Wang CS252 Spring 2007. Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

Performance Results

Page 18: Robert Liao Tracy Wang CS252 Spring 2007. Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

Performance Results

Page 19: Robert Liao Tracy Wang CS252 Spring 2007. Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

Performance IssuesMuch copying must be done from the CPU to

GPU and GPU to CPU to communicate results.

Why not convert all pointers into GPU pointers?Requires CLAPACK to run in GPU memory.Could be someone’s research paper…

Page 20: Robert Liao Tracy Wang CS252 Spring 2007. Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

Other IssuesFloating Point Behaves Differently

Section 5.2 of the CUDA Programming Guide Discusses Deviations from IEEE-754

No support for denormalized numbersUnderflowed numbers are flushed to zero

We noticed some results appearing as 0.0001 instead of 0, for example

Page 21: Robert Liao Tracy Wang CS252 Spring 2007. Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

Current StateInvestigating some interesting memory issues

on the GPU side.Allocations Mysteriously Fail.

Page 22: Robert Liao Tracy Wang CS252 Spring 2007. Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

Conclusions To DateSmall data sets are better left off on the CPU.GPU calculations may not be appropriate for

scientific computing depending on needs.

Page 23: Robert Liao Tracy Wang CS252 Spring 2007. Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

Future DirectionsMoving all of LAPACK into GPUResolving the copying issue

Perhaps resolved by unifying the CPU and GPU?

Want to give it a try?Can’t find Quadro FX 5600 on Market (MSRP

$2,999)GeForce 8 Series have the G80 Processor

GeForce 8500GT ($99.99) GeForce 8800GTX ($939.99)

Page 24: Robert Liao Tracy Wang CS252 Spring 2007. Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.