Top Banner
Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008
34

Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

Many-Core Programming with GRAMPS

Jeremy SugermanStanford PPL RetreatNovember 21, 2008

Page 2: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

22

Introduction Collaborators: Kayvon Fatahalian, Solomon Boulos,

Kurt Akeley, Pat Hanrahan Initial work appearing in ACM TOG in January, 2009

Our starting point: CPU, GPU trends… and collision? Two research areas:

– HW/SW Interface, Programming Model– Future Graphics API

Page 3: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

33

BackgroundProblem Statement / Requirements: Build a programming model / primitives / building

blocks to drive efficient development for and usage of future many-core machines.

Handle homogeneous, heterogeneous, programmable cores, and fixed-function units.

Status Quo: GPU Pipeline (Good for GL, otherwise hard) CPU / C run-time (No guidance, fast is hard)

Page 4: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

44

Apps: Graphs of stages and queues Producer-consumer, task, data-parallelism Initial focus on real-time rendering

GRAMPSInput

FragmentQueue

OutputFragment

Queue

= Thread Stage= Shader Stage= Fixed-func Stage

= Queue= Stage Output

Ray Tracer

RayQueue

Ray HitQueue Fragment

Queue

Camera Intersect

Shade FB Blend

Raster Graphics

Shade FB BlendRasterize

Page 5: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

5

Design Goals Large Application Scope– preferable to roll-your-own High Performance– Competitive with roll-your-own Optimized Implementations– Informs HW design Multi-Platform– Suits a variety of many-core systems

Also: Tunable– Expert users can optimize their apps

Page 6: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

66

As a Graphics Evolution Not (unthinkably) radical for ‘graphics’ Like fixed → programmable shading

– Pipeline undergoing massive shake up– Diversity of new parameters and use cases

Bigger picture than ‘graphics’– Rendering is more than GL/D3D– Compute is more than rendering– Some ‘GPUs’ are losing their innate pipeline

Page 7: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

77

As a Compute Evolution (1) Sounds like streaming:

Execution graphs, kernels, data-parallelism

Streaming: “squeeze out every FLOP”– Goals: bulk transfer, arithmetic intensity– Intensive static analysis, custom chips (mostly)– Bounded space, data access, execution time

Page 8: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

88

As a Compute Evolution (2) GRAMPS: “interesting apps are irregular”

– Goals: Dynamic, data-dependent code– Aggregate work at run-time– Heterogeneous commodity platforms

Streaming techniques fit naturally when applicable

Page 9: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

99

GRAMPS’ Role A ‘graphics pipeline’ is now an app! Target users: engine/pipeline/run-time authors,

savvy hardware-aware systems developers.

Compared to status quo:– More flexible, lower level than a GPU pipeline– More guidance than bare metal– Portability in between– Not domain specific

Page 10: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

10

GRAMPS Entities (1) Data access via windows into queues/memory

Queues: Dynamically allocated / managed– Ordered or unordered– Specified max capacity (could also spill)– Two types: Opaque and Collection

Buffers: Random access, Pre-allocated– RO, RW Private, RW Shared (Not Supported)

Page 11: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

11

GRAMPS Entities (2) Queue Sets: Independent sub-queues

– Instanced parallelism plus mutual exclusion– Hard to fake with just multiple queues

Page 12: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

12

Design Goals (Reminder) Large Application Scope High Performance Optimized Implementations Multi-Platform (Tunable)

Page 13: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

1313

What We’ve Built (System)

Page 14: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

14

GRAMPS Scheduling Static Inputs:

– Application graph topology– Per-queue packet (‘chunk’) size– Per-queue maximum depth / high-watermark

Dynamic Inputs (currently ignored):– Current per-queue depths– Average execution time per input packet

Simple Policy: Run consumers, pre-empt producers

Page 15: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

1515

GRAMPS Scheduler Organization Tiered Scheduler: Tier-N, Tier-1, Tier-0 Tier-N only wakes idle units, no rebalancing All Tier-1s compete for all queued work.

‘Fat’ cores: software tier-1 per core, tier-0 per thread ‘Micro’ cores: single shared hardware tier-1+0

Page 16: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

16

What We’ve Built (Apps)Direct3D Pipeline (with Ray-tracing Extension)

Ray-tracing Graph

IA 1 VS 1 RO Rast

Trace

IA N VS N

PS

SampleQueue Set

RayQueue

PrimitiveQueue

Input VertexQueue 1

PrimitiveQueue 1

Input VertexQueue N

OM

PS2

FragmentQueue

Ray HitQueue

Ray-tracing Extension

PrimitiveQueue N

Tiler

Shade FB Blend

SampleQueue

TileQueue

RayQueue

Ray HitQueue

FragmentQueue

CameraSampler Intersect

= Thread Stage= Shader Stage= Fixed-func

= Queue= Stage Output= Push Output

Page 17: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

1717

Initial Renderer Results Queues are small (< 600 KB CPU, < 1.5 MB GPU) Parallelism is good (at least 80%, all but one 95+%)

Page 18: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

1818

Scheduling Can Clearly Improve

Page 19: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

1919

Taking Stock: High-level Questions Is GRAMPS a suitable GPU evolution?

– Enable pipeline competitive with bare metal?– Enable innovation: advanced / alternative

methods?

Is GRAMPS a good parallel compute model?– Does it fulfill our design goals?

Page 20: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

20

Possible Next Steps Simulation / Hardware fidelity improvements

– Memory model, locality GRAMPS Run-Time improvements

– Scheduling, run-time overheads GRAMPS API extensions

– On-the-fly graph modification, data sharing More applications / workloads

– REYES, physics, finance, AI, … – Lazy/adaptive/procedural data generation

Page 21: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

21

Design Goals (Revisited) Application Scope: okay– only (multiple) renderers High Performance: so-so– limited simulation detail Optimized Implementations: good Multi-Platform: good (Tunable: good, but that’s a separate talk)

Strategy: Broaden available apps and use them to drive performance and simulation work for now.

Page 22: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

22

Digression: Some Kinds of Parallelism

Task (Divide) and Data (Conquer) Subdivide algorithm into a DAG (or graph) of kernels. Data is long lived, manipulated in-place. Kernels are ephemeral and stateless. Kernels only get input at entry/creation.

Producer-Consumer (Pipeline) Parallelism Data is ephemeral: processed as it is generated. Bandwidth or storage costs prohibit accumulation.

Page 23: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

23

Three New Graphs “App” 1: MapReduce Run-time

– Popular parallelism-rich idiom– Enables a variety of useful apps

App 2: Cloth Simulation (Rendering Physics)– Inspired by the PhysBAM cloth simulation– Demonstrates basic mechanics, collision detection– Graph is still very much a work in progress…

App 3: Real-time REYES-like Renderer (Kayvon)

Page 24: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

24

MapReduce: Specific Flavour

“ProduceReduce”: Minimal simplifications / constraints

Produce/Split (1:N) Map (1:N) (Optional) Combine (N:1) Reduce (N:M, where M << N or M=1 often) Sort (N:N conceptually, implementations vary)

(Aside: REYES is MapReduce, OpenGL is MapCombine)

Page 25: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

25

MapReduce Graph

Map output is a dynamically instanced queue set. Combine might motivate a formal reduction shader. Reduce is an (automatically) instanced thread stage. Sort may actually be parallelized.

= Thread Stage= Shader Stage

= Queue= Stage Output= Push Output

IntermediateTuples

Map

Output

Produce Combine

(Optional) Reduce Sort

InitialTuples

IntermediateTuples

FinalTuples

Page 26: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

26

Update is not producer-consumer! Broad Phase will actually be either a (weird) shader

or multiple thread instances. Fast Recollide details are also TBD.

Cloth Simulation Graph

= Thread Stage= Shader Stage

= Queue= Stage Output= Push Output

ResolutionProposed Update

UpdateMesh

FastRecollide

Resolve

Narrow Collide

Broad Collide

Collision Detection

BVHNodes

MovedNodes

Collisions

CandidatePairs

Page 27: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

27

That’s All Folks Thank you for listening. Any questions? Actively interested in new collaborators

– Owners or experts in some application domain (or engine / run-time system / middleware).

– Anyone interested in scheduling or details of possible hardware / core configurations.

TOG Paper:http://graphics.stanford.edu/papers/gramps-tog/

Page 28: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

Backup Slides / More Details

Page 29: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

29

Designing A Good Graph Efficiency requires “large chunks of coherent work”

Stages separate coherency boundaries– Frequency of computation (fan-out / fan-in)– Memory access coherency– Execution coherency

Queues allow repacking, re-sorting of work from one coherency regime to another.

Page 30: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

3030

GRAMPS Interfaces Host/Setup: Create execution graph

Thread: Stateful, singleton

Shader: Data-parallel, auto-instanced

Page 31: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

3131

GRAMPS Graph Portability Portability really means performance.

Less portable than GL/D3D– GRAMPS graph is (more) hardware sensitive

More portable than bare metal– Enforces modularity– Best case, just works – Worst case, saves boiler plate

Page 32: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

3232

Possible Next Steps: Implementation Better scheduling

– Less bursty, better slot filling– Dynamic priorities– Handle graphs with loops better

More detailed costs– Bill for scheduling decisions– Bill for (internal) synchronization

More statistics

Page 33: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

3333

Possible Next Steps: API Important: Graph modification (state change)

Probably: Data sharing / ref-counting

Maybe: Blocking inter-stage calls (join) Maybe: Intra/inter-stage synchronization primitives

Page 34: Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

3434

Possible Next Steps: New Workloads REYES, hybrid graphics pipelines Image / video processing Game Physics

– Collision detection or particles Physics and scientific simulation AI, finance, sort, search or database query, …

Heavy dynamic data manipulation- k-D tree / octree / BVH build- lazy/adaptive/procedural tree or geometry