IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's CUDA (Gene Cooperman, NEU)

Out-of-Core Programming with NVIDIA’s CUDA

Gene CoopermanHigh Performance Computing Lab

College of Computer and Information ScienceNortheastern University

Boston, Massachusetts 02115USA

[email protected]

Pencil and Paper Calculation

• GeForce 8800:

– 16 CPU chips/Streaming Multiprocessors (SMs),8 Cores per chip : 128 cores

– Aggregate bandwidth to off-chip global memory: 86.4 GB/s (optimal)

– Average bandwidth to global memory per core: 0.67 GB/s

• Motherboard

– 4 CPU cores

– About 10 GB/s bandwidth to main RAM

– Average bandwidth to RAM per core: 2.5 GB/s

Keeping Pipe to Memory Flowing

• Thread block: threads on a single chip

• Thread block organized into warps

• Warp of 32 threads required (minimize overhead of switching thread blocks)

• Highest bandwidth when all SMs executing same code

Memory-Bound Computations

• So, how much data can we keep in the SMs before it overflows?

• 16 KB/SM −→ 256 KB total cache

• Any computation with an active working set of more than 256 KB risks being memorybound.

Memory Bandwidth in Numbers

(Thanks to Kapil Arya and Viral Gupta; Illustrative for trends, only)X-Axis: number of thread blocksY-Axis: bandwidth (MB/s)Different curves: number of threads per thread block.

Is Life Any Better Back on the Motherboard?

• Up to 10 GB/s bandwidth to motherboard (perhaps five times slower than NVIDIA inpractice)

• Four cores competing for bandwidth

• Cache of at least 1 MB, and possibly much more (e.g L3 cache)

• Conclusion: Less pressure on memory, but similar order of magnitude

Is Life Any Better between CPU and Disk?

• Between 0.05 GB and 0.1 GB bandwidth to disk

• Four cores competing for bandwidth

• Cache consists of 4 GB or more of RAM

• Conclusion: huge pressure on memory (but RAM as cache is large)

Our Solution

• Disk is the New RAM

• Bandwidth of Disk: ˜ 100 MB/s

• Bandwidth of 50 Disks: 50×100 MB/s = 5 GB/s

• Bandwidth of RAM: approximately 5 GB/s

• Conclusion:

1. CLAIM: A computer cluster of 50 quad-core nodes, each with 500 GB of mostlyidle disk space, is a good approximation to a shared memory computer with 200CPU cores and a single subsystem with 25 TB of shared memory.(The arguments also work for a SAN with multiple access nodes, but we considerlocal disks for simplicity.)

2. The disks of a cluster can serve as if they were RAM.

3. The traditional RAM can then serve as if it were cache.

Our Solution

• Disk is the New RAM

• Bandwidth of Disk: ˜ 100 MB/s

• Bandwidth of 50 Disks: 50×100 MB/s = 5 GB/s

• Bandwidth of RAM: approximately 5 GB/s

• Conclusion:

1. CLAIM: A computer cluster of 50 quad-core nodes, each with 500 GB of mostlyidle disk space, is a good approximation to a shared memory computer with 200CPU cores and a single subsystem with 25 TB of shared memory.(The arguments also work for a SAN with multiple access nodes, but we considerlocal disks for simplicity.)

2. The disks of a cluster can serve as if they were RAM.

3. The traditional RAM can then serve as if it were cache.

What About Disk Latency?

• Unfortunately, putting 50 disks on it, doesn’t speed up the latency.

• So, re-organize the data structures and low-level algorithms.

• Our group has five years of case histories applying this computational algebra — buteach case requires months of development and debugging.

• We’re now developing both higher level abstractions for run-time libraries, and alanguage extension that will make future development much faster.

Applications Benefiting from Disk-Based Parallel Computation

Discipline Example Application1. Verification Symbolic Computation using BDDs2. Verification Explicit State Verification3. Comp. Group Theory Search and Enumeration in Mathematical Structures4. Coding Theory Search for New Codes5. Security Exhaustive Search for Passwords6. Semantic Web RDF query language; OWL Web Ontology Language7, Artificial Intelligence Planning8. Proteomics Protein folding via a kinetic network model9. Operations Research Branch and Bound

10. Operations Research Integer Programming (applic. of Branch-and-Bound)11. Economics Dynamic Programming12. Numerical Analysis ATLAS, PHiPAC, FFTW, and other adaptive software13. Engineering Sensor Data14. A.I. Search Rubik’s Cube

Central Claim

Suppose one had a single computer with 10 terabytes of RAM and 200 CPU cores. Doesthat satisfy your need for computers with more RAM?

CLAIM: A computer cluster of 32 quad-core nodes, each with a 500 GB local disk, isa good approximation of the above computer. (The arguments also work for a SAN with

multiple access nodes, but we discuss local disks for simplicity.)

When is a cluster like a 10 TB shared memory computer?

• Assume 200 GB/node of free disk space

• Assume 50 nodes,

• The bandwidth of 50 disks is 50×100MB/s = 5GB/s.

• The bandwidth of a single RAM subsystem is about 5GB/s.

CLAIM: You probably have the 10 TB of temporary disk space lying idle on your ownrecent-model computer cluster. You just didn’t know it.(Or were you just not telling other people about the space, so you could use if for yourself?)

The economics of disks are such that one saves very little by buying less than 500 GBdisk per node. It’s common to buy the 500 GB disk, and reserve the extra space forexpansion.

When is a cluster NOT like a 10 TB shared memory computer?

1. We require a parallel program. (We must access the local disks of many cluster nodesin parallel.)

2. The latency problem of disk.

3. Can the network keep up with the disk?


. . . and why doesn’t it matter for our purposes?

• ANSWER 1: We’ve used this architecture, and it works for us.

• We’ve developed solutions for a series of algorithmically simple computational kernelsfrom computational algebra — especially mathematical group theory. All of thefollowing computations completed in less than one cluster-week on a cluster of 60 nodesor less.

– Construction of Thompson Sporadic Simple Group (2003)2 gigabytes (temporary space), 1.4×108 states, 4 bytes per state

– Construction of Baby Monster Sporadic Simple Group (2006)6 terabytes (temporary space), 1.4×1010 states, 12 bytes per state

– Condensation of Fi23 Sporadic Simple Group (2007)400 GB (temporary space) 1.2×1010 states, 30 bytes per state(larger condensation for J4 now in progress)

– Rubik’s Cube: 26 Moves Suffice to Solve Rubik’s Cube (2007)7 terabytes (temporary space), 1012 states, 6 bytes per state

– In progress: coset enumeration (pointer-chasing: similar to algorithm for convertingNFA to DFA (finite automata)).


1. We require a parallel program.





1. We require a parallel program. (We must access the local disks of many nodes inparallel.)

• Our bet (still to be proved): Any sequential algorithm that already creates gigabytesof RAM-based data should have a way to create that data in parallel.

2. The latency problem of disk. Solutions exist:

(a) For duplicates on frontier in state space search: Delayed Duplicate Detectionimplies waiting until many nodes of the next frontier (and duplicates from previousiterations) have been discovered. Then remove duplicates.

(b) For hash tables, wait until there are millions of hash queries. Then sort on the hashindex, and scan the disk to resolve queries.

(c) For pointer-chasing, wait until millions of pointers are available for chasing. Thensort and scan the disk to dereference pointers.

(d) For tracing strings, with each string being a lookup, wait until millions of strings areavailable. Then ....




1. We require a parallel program. (We must access the local disks of many nodes inparallel.)


3. Can the network keep up with the disk?(In our experience to date, the network does keep up. Here are some reasons why itseems to just work.)

• The point-to-point bandwidth of Gigabit Ethernet is about 100 MB/s. The bandwidthof disk is about 100 MB/s. As long as the aggregate bandwidth of network can keepup, everything is fine.

• Researchers already face the issue of aggregate network bandwidth in RAM-basedprograms. The disk is slower than RAM. So, probably traditional parallel programscan cope.

Applications from Computational Group Theory (2003–2007)

Space State TotalGroup Size Size StorageFischer Fi23 1.17×1010 100 bytes 1 TB“Baby Monster” 1.35×1010 548 bytes 7 TBJanko J4 1.31×1011 64 bytes 8 TB

(joint with Eric Robinson)

History of Rubik’s Cube

• Invented in late 1970s in Hungary.

• In 1982, in Cubik Math, Singmaster and Frey conjectured:

No one knows how many moves would be needed for “God’s Algorithm”assuming he always used the fewest moves required to restore the cube. Ithas been proven that some patterns must exist that require at least seventeenmoves to restore but no one knows what those patterns may be. Experiencedgroup theorists have conjectured that the smallest number of moves which wouldbe sufficient to restore any scrambled pattern — that is, the number of movesrequired for “God’s Algorithm” — is probably in the low twenties.

• Current Best Guess: 20 moves suffice

– States needing 20 moves are known

History of Rubik’s Cube (cont.)

• Invented in late 1970s in Hungary.

• 1982: “God’s Number” (number of moves needed) was known by authors of conjectureto be between 17 and 52.

• 1990: C., Finkelstein, and Sarawagi showed 11 moves suffice for Rubik’s 2×2×2 cube(corner cubies only)

• 1995: Reid showed 29 moves suffice (lower bound of 20 already known)

• 2006: Radu showed 27 moves suffice

• 2007 Kunkle and C. showed 26 moves suffice

• 2008 Rockiki showed 22 moves suffice (using idle resources at Sony Pictures)

Large-Memory Apps: Experience in N.U. Course

(mixed undergrads and grads)

1. Chaitin’s Algorithm

2. Fast Permutation Multiplication

3. Kernighan-Lin Partitioning Algorithm

4. Large matrix-matrix Multiplication

5. Voronoi Diagrams

6. Cellular Automata

7. GAA* Search

8. Static Performance Evaluation for Memory Bound Computing

Others:[BFS using External Sort] BFS using External Sort[BFS using Segments & Hash Array] BFS using Segments & Hash Array[Fast Permutation Multiplication] Fast Permutation Multiplication[Kernighan-Lin Partitioning Algorithm] Kernighan-Lin Partitioning Algorithm[Large matrix-matrix Multiplication] Large matrix-matrix Multiplication[SAT Solver] SAT Solver

Example: Rubik’s Cube: Sorting Delayed Duplicate Detection

1. Breadth-first search: storing new frontier (open list) on disk

2. Use Bucket Sorting to sort and eliminate duplicate states from the newfrontier(The bucket size is chosen to fit in RAM (the new cache).

3. Storing the new frontier requires 6 terabytes of disk space (and we woulduse more if we had it). Saving a large new frontier on disk prior to sortingdelays duplicate detection, but makes the routine more efficient due toeconomies of scale.

Rubik’s Cube: Two-Bit trick

1. The final representation of the state space (1.4× 1012 states) could use only 2 bits perstate. (We use 4 bits per state for convenience.)

2. We used mathematical group theory to derive a highly dense, perfect hash function (nocollisions) for the states of |cube|/|S|.

3. Our hash function represents symmetrized cosets (the union of all symmetric states of|cube|/|S| under the symmetries of the cube).

4. Each hash slot need only store the level in the search tree modulo 3. This allowsthe algorithm to distinguish states from the current frontier, the next frontier, and theprevious frontier (current level; current level plus one; and current level minus one).This is all that is needed.

Space-Time Tradeoffs using Additional Disk

• Use even more disk space in order to speed up the algorithm.

“A Comparative Analysis of Parallel Disk-Based Methods for Enumerating Implicit Graphs”, Eric Robinson,

Daniel Kunkle and Gene Cooperman, Proc. of 2007 International Workshop on Parallel Symbolic and

Algebraic Computation (PASCO ’07), ACM Press, 2007, pp. 78–87

LONGER-TERM GOAL: Mini-Language Extension

Well-understood building blocks already exist: external sorting, B-trees, Bloom filters,Delayed Duplicate Detection, Distributed Hash Trees (DHT), and some still more exoticalgorithms.

GOAL: Provide language extensions for common data structures and algorithms (includingbreadth-first search) that invoke a run-time library. Design the language to bias theprogrammer toward efficient use of disk.

ROOMY LANGUAGE:New Parallel Disk-Based Language, Roomy, in development by Daniel Kunkle.Implementation: Run-time C library with #define and typedef for nicer syntax.Language appears to be sequential; back-end based on cluster with local disks; or clusterwith SAN; or single computer using RAM (for simpler development and debugging)Expected availability: mid-2009

IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's CUDA (Gene Cooperman, NEU)

Education