Top Banner
Go Wrap up Parallel Architectures Chris Rossbach cs378 Fall 2018 10/15/2018
176

Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Sep 24, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Go Wrap upParallel Architectures

Chris Rossbach

cs378 Fall 2018

10/15/2018

Page 2: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Outline for Today• Questions?

• Administrivia

• Agenda• Go

• Parallel Architectures (GPU background)

• Rob Pike’s 2012 Go presentation is excellent, and I borrowed from it: https://talks.golang.org/2012/concurrency.slide

Page 3: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Faux Quiz questions

• How are promises and futures different or the same as goroutines

• What is the difference between a goroutine and a thread?

• What is the difference between a channel and a lock?

• How is a channel different from a concurrent FIFO?

• What is the CSP model?

• What are the tradeoffs between explicit vs implicit naming in message passing?

• What are the tradeoffs between blocking vs. non-blocking send/receive in a shared memory environment? In a distributed one?

• What is hardware multi-threading; what problem does it solve?

• What is the difference between a vector processor and a scalar?

• Implement a parallel scan or reduction

• How are GPU workloads different from GPGPU workloads?

• How does SIMD differ from SIMT?

• List and describe some pros and cons of vector/SIMD architectures.

• GPUs historically have elided cache coherence.Why? What impact does it have on the the programmer?

• List some ways that GPUs use concurrency but not necessarily parallelism.

Page 4: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Google Search

• Workload:

• Accept query

• Return page of results (with ugh, ads)

• Get search results by sending query to • Web Search• Image Search• YouTube• Maps• News, etc

• How to implement this?

Page 5: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Search 1.0

• Google function takes query and returns a slice of results (strings)

• Invokes Web, Image, Video search serially

Page 6: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Search 2.0

• Run Web, Image, Video searches concurrently, wait for results

• No locks, conditions, callbacks

Page 7: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Search 2.1

• Don’t wait for slow servers: No locks, conditions, callbacks!

Page 8: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Search 3.0

• Reduce tail latency with replication. No locks, conditions, callbacks!

Page 9: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Go: magic? …or threadpools and concurrent Qs?

• We’ve seen several abstractions for • Control flow/exection

• Communication

• Lots of discussion of pros and cons

• Ultimately still CPUs + instructions

• Go: just sweeping issues under the language interface?• Why is it OK to have 100,000s of goroutines?

• Why isn’t composition an issue?

Page 10: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Go implementation details

Page 11: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Go implementation details

• M = “machine” OS thread

Page 12: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Go implementation details

• M = “machine” OS thread

• P = (processing) context

Page 13: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Go implementation details

• M = “machine” OS thread

• P = (processing) context

• G = goroutines

Page 14: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Go implementation details

• M = “machine” OS thread

• P = (processing) context

• G = goroutines

• Each ‘M’ has a queue of goroutines

Page 15: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Go implementation details

• M = “machine” OS thread

• P = (processing) context

• G = goroutines

• Each ‘M’ has a queue of goroutines

• Goroutine scheduling is cooperative• Switch out on complete or block

• Very light weight (fibers!)

• Scheduler does work-stealing

Page 16: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Go implementation details

• M = “machine” OS thread

• P = (processing) context

• G = goroutines

• Each ‘M’ has a queue of goroutines

• Goroutine scheduling is cooperative• Switch out on complete or block

• Very light weight (fibers!)

• Scheduler does work-stealing

Page 17: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Go implementation details

• M = “machine” OS thread

• P = (processing) context

• G = goroutines

• Each ‘M’ has a queue of goroutines

• Goroutine scheduling is cooperative• Switch out on complete or block

• Very light weight (fibers!)

• Scheduler does work-stealing

Page 18: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Go implementation details

• M = “machine” OS thread

• P = (processing) context

• G = goroutines

• Each ‘M’ has a queue of goroutines

• Goroutine scheduling is cooperative• Switch out on complete or block

• Very light weight (fibers!)

• Scheduler does work-stealing

Page 19: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Go implementation details

• M = “machine” OS thread

• P = (processing) context

• G = goroutines

• Each ‘M’ has a queue of goroutines

• Goroutine scheduling is cooperative• Switch out on complete or block

• Very light weight (fibers!)

• Scheduler does work-stealing

Page 20: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Go implementation details

• M = “machine” OS thread

• P = (processing) context

• G = goroutines

• Each ‘M’ has a queue of goroutines

• Goroutine scheduling is cooperative• Switch out on complete or block

• Very light weight (fibers!)

• Scheduler does work-stealing

Page 21: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Go implementation details

• M = “machine” OS thread

• P = (processing) context

• G = goroutines

• Each ‘M’ has a queue of goroutines

• Goroutine scheduling is cooperative• Switch out on complete or block

• Very light weight (fibers!)

• Scheduler does work-stealing

Page 22: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

func testQ(consumers int) {startTimes["testQ"] = time.Now()var wg sync.WaitGroupwg.Add(consumers)ch := make(chan int)for i:=0; i<consumers; i++ {

go func(id int) {aval, amore := <- chif(amore) {

info("reader #%d got %d value\n", id, aval)} else {

info("channel reader #%d terminated with nothing.\n", id)}wg.Done()

}(i)}time.Sleep(1000 * time.Millisecond)close(ch)wg.Wait()stopTimes["testQ"] = time.Now()

}

1000s of go routines?

Page 23: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

func testQ(consumers int) {startTimes["testQ"] = time.Now()var wg sync.WaitGroupwg.Add(consumers)ch := make(chan int)for i:=0; i<consumers; i++ {

go func(id int) {aval, amore := <- chif(amore) {

info("reader #%d got %d value\n", id, aval)} else {

info("channel reader #%d terminated with nothing.\n", id)}wg.Done()

}(i)}time.Sleep(1000 * time.Millisecond)close(ch)wg.Wait()stopTimes["testQ"] = time.Now()

}

1000s of go routines? • Creates a channel• Creates “consumers” goroutines• Each of them tries to read from the channel• Main either:

• Sleeps for 1 second, closes the channel• sends “consumers” values

Page 24: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

func testQ(consumers int) {startTimes["testQ"] = time.Now()var wg sync.WaitGroupwg.Add(consumers)ch := make(chan int)for i:=0; i<consumers; i++ {

go func(id int) {aval, amore := <- chif(amore) {

info("reader #%d got %d value\n", id, aval)} else {

info("channel reader #%d terminated with nothing.\n", id)}wg.Done()

}(i)}time.Sleep(1000 * time.Millisecond)close(ch)wg.Wait()stopTimes["testQ"] = time.Now()

}

1000s of go routines? • Creates a channel• Creates “consumers” goroutines• Each of them tries to read from the channel• Main either:

• Sleeps for 1 second, closes the channel• sends “consumers” values

Page 25: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Channel implementation

• You can just read it:• https://golang.org/src/runtime/chan.go

• Some highlights

Page 26: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Channel implementation

• You can just read it:• https://golang.org/src/runtime/chan.go

• Some highlights

Page 27: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Channel implementation

• You can just read it:• https://golang.org/src/runtime/chan.go

• Some highlights

Race detection! Cool!

Page 28: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Channel implementation

• You can just read it:• https://golang.org/src/runtime/chan.go

• Some highlights

Page 29: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Channel implementation

• You can just read it:• https://golang.org/src/runtime/chan.go

• Some highlights

Page 30: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Channel implementation

• You can just read it:• https://golang.org/src/runtime/chan.go

• Some highlights

Page 31: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Channel implementation

• You can just read it:• https://golang.org/src/runtime/chan.go

• Some highlights

Page 32: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Channel implementation

• You can just read it:• https://golang.org/src/runtime/chan.go

• Some highlights

Page 33: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Channel implementation

• You can just read it:• https://golang.org/src/runtime/chan.go

• Some highlights

Page 34: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Channel implementation

• You can just read it:• https://golang.org/src/runtime/chan.go

• Some highlights

Transputers did this in hardware in the 90s btw.

Page 35: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Channel implementation

• You can just read it:• https://golang.org/src/runtime/chan.go

• Some highlights:• Race detection built in

• Fast path just write to receiver stack

• Often has no capacity scheduler hint!

• Buffered channel implementation fairly standard

Page 36: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web
Page 37: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

A modern GPU: Volta V100

Page 38: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

A modern GPU: Volta V100• 80 SMs

• Streaming Multiprocessor

Page 39: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

A modern GPU: Volta V100• 80 SMs

• Streaming Multiprocessor

Also: CU or ACE

Page 40: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

A modern GPU: Volta V100• 80 SMs

• Streaming Multiprocessor

Page 41: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

A modern GPU: Volta V100• 80 SMs

• Streaming Multiprocessor• 64 cores/SM• 5210 threads!• 15.7 TFLOPS

Page 42: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

A modern GPU: Volta V100• 80 SMs

• Streaming Multiprocessor• 64 cores/SM• 5210 threads!• 15.7 TFLOPS

Roughly: all of k-means

1,000s X/sec

Page 43: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

A modern GPU: Volta V100• 80 SMs

• Streaming Multiprocessor• 64 cores/SM• 5210 threads!• 15.7 TFLOPS

Page 44: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

A modern GPU: Volta V100• 80 SMs

• Streaming Multiprocessor• 64 cores/SM• 5210 threads!• 15.7 TFLOPS

• 640 Tensor cores

Page 45: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

A modern GPU: Volta V100• 80 SMs

• Streaming Multiprocessor• 64 cores/SM• 5210 threads!• 15.7 TFLOPS

• 640 Tensor cores

• HBM2 memory• 4096-bit bus• No cache coherence!

Page 46: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

A modern GPU: Volta V100• 80 SMs

• Streaming Multiprocessor• 64 cores/SM• 5210 threads!• 15.7 TFLOPS

• 640 Tensor cores

• HBM2 memory• 4096-bit bus• No cache coherence!

• 16 GB memory• PCIe-attached

Page 47: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

A modern GPU: Volta V100• 80 SMs

• Streaming Multiprocessor• 64 cores/SM• 5210 threads!• 15.7 TFLOPS

• 640 Tensor cores

• HBM2 memory• 4096-bit bus• No cache coherence!

• 16 GB memory• PCIe-attached

Page 48: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

A modern GPU: Volta V100• 80 SMs

• Streaming Multiprocessor• 64 cores/SM• 5210 threads!• 15.7 TFLOPS

• 640 Tensor cores

• HBM2 memory• 4096-bit bus• No cache coherence!

• 16 GB memory• PCIe-attached

Page 49: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

A modern GPU: Volta V100• 80 SMs

• Streaming Multiprocessor• 64 cores/SM• 5210 threads!• 15.7 TFLOPS

• 640 Tensor cores

• HBM2 memory• 4096-bit bus• No cache coherence!

• 16 GB memory• PCIe-attached

How do you program a machine like this? pthread_create()?

Page 50: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

GPUs: Outline

• Background from many areas• Architecture

• Vector processors• Hardware multi-threading

• Graphics• Graphics pipeline• Graphics programming models

• Algorithms• parallel architectures parallel algorithms

• Programming GPUs• CUDA• Basics: getting something working• Advanced: making it perform

Page 51: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Architecture Review: PipelinesProcessor algorithm:

main() {

while(true)

do_next_instruction();

}

Page 52: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Architecture Review: PipelinesProcessor algorithm:

main() {

while(true)

do_next_instruction();

}

do_next_instruction() {instruction = fetch();ops, regs = decode(instruction);execute_calc_addrs(ops, regs);access_memory(ops, regs);write_back(regs);

}

Page 53: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Architecture Review: PipelinesProcessor algorithm:

main() {

while(true)

do_next_instruction();

}

do_next_instruction() {instruction = fetch();ops, regs = decode(instruction);execute_calc_addrs(ops, regs);access_memory(ops, regs);write_back(regs);

}

main() { pthread_create(do_instructions);pthread_create(do_decode);pthread_create(do_execute);…pthread_join(…);…

}

Page 54: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Architecture Review: PipelinesProcessor algorithm:

main() {

while(true)

do_next_instruction();

}

do_instructions() {while(true) {

instruction = fetch();enqueue(DECODE, instruction);

}}

do_decode() {while(true) {

instruction = dequeue();ops, regs = decode(instruction); enqueue(EX, instruction);

}}

do_execute() {while(true) {

instruction = dequeue();execute_calc_addrs(ops, regs);enqueue(MEM, instruction);

}}

….

do_next_instruction() {instruction = fetch();ops, regs = decode(instruction);execute_calc_addrs(ops, regs);access_memory(ops, regs);write_back(regs);

}

main() { pthread_create(do_instructions);pthread_create(do_decode);pthread_create(do_execute);…pthread_join(…);…

}

Page 55: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Architecture Review: PipelinesProcessor algorithm:

main() {

while(true) {

do_next_instruction();

}

Page 56: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Architecture Review: PipelinesProcessor algorithm:

main() {

while(true) {

do_next_instruction();

}

do_next_instruction() {

instruction = fetch();

ops, regs = decode(instruction);

execute_calc_addrs(ops, regs);

access_memory(ops, regs);

write_back(regs);

}

Page 57: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Architecture Review: PipelinesProcessor algorithm:

main() {

while(true) {

do_next_instruction();

}

do_next_instruction() {

instruction = fetch();

ops, regs = decode(instruction);

execute_calc_addrs(ops, regs);

access_memory(ops, regs);

write_back(regs);

}

Page 58: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Architecture Review: PipelinesProcessor algorithm:

main() {

while(true) {

do_next_instruction();

}

do_next_instruction() {

instruction = fetch();

ops, regs = decode(instruction);

execute_calc_addrs(ops, regs);

access_memory(ops, regs);

write_back(regs);

}

Page 59: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Architecture Review: PipelinesProcessor algorithm:

main() {

while(true) {

do_next_instruction();

}

do_next_instruction() {

instruction = fetch();

ops, regs = decode(instruction);

execute_calc_addrs(ops, regs);

access_memory(ops, regs);

write_back(regs);

}

What is the name of this kind of parallelism?

Page 60: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Architecture Review: PipelinesProcessor algorithm:

main() {

while(true) {

do_next_instruction();

}

do_next_instruction() {

instruction = fetch();

ops, regs = decode(instruction);

execute_calc_addrs(ops, regs);

access_memory(ops, regs);

write_back(regs);

}

What is the name of this kind of parallelism?

Works well if pipeline is kept fullWhat kinds of things cause “bubbles”/stalls?

Page 61: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Architecture Review: PipelinesProcessor algorithm:

main() {

while(true) {

do_next_instruction();

}

do_next_instruction() {

instruction = fetch();

ops, regs = decode(instruction);

execute_calc_addrs(ops, regs);

access_memory(ops, regs);

write_back(regs);

}

What is the name of this kind of parallelism?

How can we get *more* parallelism?

Works well if pipeline is kept fullWhat kinds of things cause “bubbles”/stalls?

Page 62: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Architecture Review: PipelinesProcessor algorithm:

main() {

while(true) {

do_next_instruction();

}

do_next_instruction() {

instruction = fetch();

ops, regs = decode(instruction);

execute_calc_addrs(ops, regs);

access_memory(ops, regs);

write_back(regs);

}

What is the name of this kind of parallelism?

How can we get *more* parallelism?

Works well if pipeline is kept fullWhat kinds of things cause “bubbles”/stalls?

Page 63: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Architecture Review: PipelinesProcessor algorithm:

main() {

while(true) {

do_next_instruction();

}

do_next_instruction() {

instruction = fetch();

ops, regs = decode(instruction);

execute_calc_addrs(ops, regs);

access_memory(ops, regs);

write_back(regs);

}

What is the name of this kind of parallelism?

How can we get *more* parallelism?

Works well if pipeline is kept fullWhat kinds of things cause “bubbles”/stalls?

Page 64: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Multi-core/SMPs

Page 65: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Multi-core/SMPs

Page 66: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Multi-core/SMPs

Page 67: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Multi-core/SMPs

Page 68: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Multi-core/SMPs

Page 69: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Multi-core/SMPsmain() {

for(i=0; i<CORES; i++) {

pthread_create(

do_instructions());

}

}do_instructions() {

while(true) {

instruction = fetch();

ops, regs = decode(instruction);

execute_calc_addrs(ops, regs);

access_memory(ops, regs);

write_back(regs);

}}

Page 70: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Multi-core/SMPsmain() {

for(i=0; i<CORES; i++) {

pthread_create(

do_instructions());

}

}do_instructions() {

while(true) {

instruction = fetch();

ops, regs = decode(instruction);

execute_calc_addrs(ops, regs);

access_memory(ops, regs);

write_back(regs);

}}

• Pros: Simple• Cons: programmer has to find the parallelism!

Page 71: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Multi-core/SMPsmain() {

for(i=0; i<CORES; i++) {

pthread_create(

do_instructions());

}

}do_instructions() {

while(true) {

instruction = fetch();

ops, regs = decode(instruction);

execute_calc_addrs(ops, regs);

access_memory(ops, regs);

write_back(regs);

}}Other techniques extract

parallelism here, try to let the machine find parallelism

• Pros: Simple• Cons: programmer has to find the parallelism!

Page 72: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Superscalar processors

Page 73: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Superscalar processorsRemove extra

instruction streams

Page 74: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Superscalar processors

Page 75: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Superscalar processors

Page 76: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Superscalar processors main() {for(i=0; i<CORES; i++)

pthread_create(decode_exec);while(true) {

instruction = fetch();enqueue(instruction);

}}

decode_exec() {instruction = dequeue();ops, regs = decode(instruction);execute_calc_addrs(ops, regs);access_memory(ops, regs);write_back(regs);

}

Page 77: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Superscalar processors main() {for(i=0; i<CORES; i++)

pthread_create(decode_exec);while(true) {

instruction = fetch();enqueue(instruction);

}}

decode_exec() {instruction = dequeue();ops, regs = decode(instruction);execute_calc_addrs(ops, regs);access_memory(ops, regs);write_back(regs);

}

Doesn’t look that different does it? Why do it?

Page 78: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Superscalar processors main() {for(i=0; i<CORES; i++)

pthread_create(decode_exec);while(true) {

instruction = fetch();enqueue(instruction);

}}

decode_exec() {instruction = dequeue();ops, regs = decode(instruction);execute_calc_addrs(ops, regs);access_memory(ops, regs);write_back(regs);

}

Doesn’t look that different does it? Why do it?

Enables independent instruction parallelism.

Page 79: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Superscalar processors main() {for(i=0; i<CORES; i++)

pthread_create(decode_exec);while(true) {

instruction = fetch();enqueue(instruction);

}}

decode_exec() {instruction = dequeue();ops, regs = decode(instruction);execute_calc_addrs(ops, regs);access_memory(ops, regs);write_back(regs);

}

Doesn’t look that different does it? Why do it?

Enables independent instruction parallelism.

Page 80: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Superscalar processors main() {for(i=0; i<CORES; i++)

pthread_create(decode_exec);while(true) {

instruction = fetch();enqueue(instruction);

}}

decode_exec() {instruction = dequeue();ops, regs = decode(instruction);execute_calc_addrs(ops, regs);access_memory(ops, regs);write_back(regs);

}

Doesn’t look that different does it? Why do it?

independent

Enables independent instruction parallelism.

Page 81: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Vector/SIMD processors

Page 82: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Vector/SIMD processors

Page 83: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Vector/SIMD processorsWhy decode same instruction

sequence over and over?

Page 84: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Vector/SIMD processors

Page 85: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Vector/SIMD processorsmain() {

for(i=0; i<CORES; i++)

pthread_create(exec);

while(true) {

ops, regs = fetch_decode();

enqueue(ops, regs);

}

}

exec() {

ops, regs = dequeue();

execute_calc_addrs(ops, regs);

access_memory(ops, regs);

write_back(regs);

}

Page 86: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Vector/SIMD processorsmain() {

for(i=0; i<CORES; i++)

pthread_create(exec);

while(true) {

ops, regs = fetch_decode();

enqueue(ops, regs);

}

}

exec() {

ops, regs = dequeue();

execute_calc_addrs(ops, regs);

access_memory(ops, regs);

write_back(regs);

}

Single instruction stream, multiple computations

Page 87: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Vector/SIMD processorsmain() {

for(i=0; i<CORES; i++)

pthread_create(exec);

while(true) {

ops, regs = fetch_decode();

enqueue(ops, regs);

}

}

exec() {

ops, regs = dequeue();

execute_calc_addrs(ops, regs);

access_memory(ops, regs);

write_back(regs);

}

Single instruction stream, multiple computations

But now all my instructions need multiple operands!

Page 88: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

22

Vector Processors

• Process multiple data elements simultaneously.

• Common in supercomputers of the 1970’s 80’s and 90’s.

• Modern CPUs support some vector processing instructions• Usually called SIMD

• Can operate on a few vectors elements per clock cycle in a pipeline or, • SIMD operate on all per clock cycle

Page 89: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

22

Vector Processors

• Process multiple data elements simultaneously.

• Common in supercomputers of the 1970’s 80’s and 90’s.

• Modern CPUs support some vector processing instructions• Usually called SIMD

• Can operate on a few vectors elements per clock cycle in a pipeline or, • SIMD operate on all per clock cycle

• 1962 University of Illinois Illiac IV - completed 1972 64 ALUs 100-150 MFlops

• (1973) TI’s Advance Scientific Computer (ASC) 20-80 MFlops

• (1975) Cray-1 first to have vector registers instead of keeping data in memory

Page 90: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

22

Vector Processors

• Process multiple data elements simultaneously.

• Common in supercomputers of the 1970’s 80’s and 90’s.

• Modern CPUs support some vector processing instructions• Usually called SIMD

• Can operate on a few vectors elements per clock cycle in a pipeline or, • SIMD operate on all per clock cycle

• 1962 University of Illinois Illiac IV - completed 1972 64 ALUs 100-150 MFlops

• (1973) TI’s Advance Scientific Computer (ASC) 20-80 MFlops

• (1975) Cray-1 first to have vector registers instead of keeping data in memory

Single instruction stream, multiple data Programming model has to change

Page 91: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Vector ProcessorsImplementation:

• Instruction fetch control logic shared

• Same instruction stream executed on

• Multiple pipelines

• Multiple different operands in parallel

Page 92: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Vector ProcessorsImplementation:

• Instruction fetch control logic shared

• Same instruction stream executed on

• Multiple pipelines

• Multiple different operands in parallel

Page 93: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Vector ProcessorsImplementation:

• Instruction fetch control logic shared

• Same instruction stream executed on

• Multiple pipelines

• Multiple different operands in parallel

Page 94: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Vector ProcessorsImplementation:

• Instruction fetch control logic shared

• Same instruction stream executed on

• Multiple pipelines

• Multiple different operands in parallel

GPUs: same basic idea

Page 95: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

When does vector processing help?

Page 96: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

When does vector processing help?

What are the potential bottlenecks here?When can it improve throughput?

Page 97: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

When does vector processing help?

What are the potential bottlenecks here?When can it improve throughput?

Only helps if memory can keep the pipeline busy!

Page 98: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Hardware multi-threading

Page 99: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Hardware multi-threading

• Address memory bottleneck

Page 100: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Hardware multi-threading

• Address memory bottleneck

• Share exec unit across • Instruction streams

• Switch on stalls

Page 101: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Hardware multi-threading

• Address memory bottleneck

• Share exec unit across • Instruction streams

• Switch on stalls

Page 102: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Hardware multi-threading

• Address memory bottleneck

• Share exec unit across • Instruction streams

• Switch on stalls

Page 103: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Hardware multi-threading

• Address memory bottleneck

• Share exec unit across • Instruction streams

• Switch on stalls

• Looks like multiple cores to the OS

Page 104: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Hardware multi-threading

• Address memory bottleneck

• Share exec unit across • Instruction streams

• Switch on stalls

• Looks like multiple cores to the OS

• Three variants:• Coarse

• Fine-grain

• Simultaneous

Page 105: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Running example

Thread A Thread B Thread C Thread D

• Colors pipeline full• White stall

Page 106: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Coarse- grained multithreading

Page 107: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Coarse- grained multithreading

• Single thread runs until a costly stall• E.g. 2nd level cache miss

Page 108: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Coarse- grained multithreading

• Single thread runs until a costly stall• E.g. 2nd level cache miss

• Another thread starts during stall• Pipeline fill time requires several cycles!

Page 109: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Coarse- grained multithreading

• Single thread runs until a costly stall• E.g. 2nd level cache miss

• Another thread starts during stall• Pipeline fill time requires several cycles!

Page 110: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Coarse- grained multithreading

• Single thread runs until a costly stall• E.g. 2nd level cache miss

• Another thread starts during stall• Pipeline fill time requires several cycles!

• Does not cover short stalls

Page 111: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Coarse- grained multithreading

• Single thread runs until a costly stall• E.g. 2nd level cache miss

• Another thread starts during stall• Pipeline fill time requires several cycles!

• Does not cover short stalls

• Hardware support required• PC and register file for each thread

• little other hardware

• Looks like another physical CPU to OS/software

Page 112: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Coarse- grained multithreading

• Single thread runs until a costly stall• E.g. 2nd level cache miss

• Another thread starts during stall• Pipeline fill time requires several cycles!

• Does not cover short stalls

• Hardware support required• PC and register file for each thread

• little other hardware

• Looks like another physical CPU to OS/software

Page 113: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Fine-grained multithreading

Page 114: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Fine-grained multithreading

• Threads interleave instructions• Round-robin

• Skip stalled threads

Page 115: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Fine-grained multithreading

• Threads interleave instructions• Round-robin

• Skip stalled threads

Page 116: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Fine-grained multithreading

• Threads interleave instructions• Round-robin

• Skip stalled threads

• Hardware support required• Separate PC and register file per thread

• Hardware to control alternating pattern

Page 117: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Fine-grained multithreading

• Threads interleave instructions• Round-robin

• Skip stalled threads

• Hardware support required• Separate PC and register file per thread

• Hardware to control alternating pattern

• Naturally hides delays• Data hazards, Cache misses

• Pipeline runs with rare stalls

Page 118: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Fine-grained multithreading

• Threads interleave instructions• Round-robin

• Skip stalled threads

• Hardware support required• Separate PC and register file per thread

• Hardware to control alternating pattern

• Naturally hides delays• Data hazards, Cache misses

• Pipeline runs with rare stalls

• Doesn’t make full use of multi-issue

Page 119: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Fine-grained multithreading

• Threads interleave instructions• Round-robin

• Skip stalled threads

• Hardware support required• Separate PC and register file per thread

• Hardware to control alternating pattern

• Naturally hides delays• Data hazards, Cache misses

• Pipeline runs with rare stalls

• Doesn’t make full use of multi-issue

Page 120: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Simultaneous Multithreading (SMT)

Page 121: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Simultaneous Multithreading (SMT)• Instructions from multiple threads

issued on same cycle• Uses register renaming

• dynamic scheduling facility of multi-issue architecture

Page 122: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Simultaneous Multithreading (SMT)• Instructions from multiple threads

issued on same cycle• Uses register renaming

• dynamic scheduling facility of multi-issue architecture

Skip A

Skip C

Page 123: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Simultaneous Multithreading (SMT)• Instructions from multiple threads

issued on same cycle• Uses register renaming

• dynamic scheduling facility of multi-issue architecture

• Hardware support:• Register files, PCs per thread

• Temporary result registers pre commit

• Support to sort out which threads get results from which instructions

Skip A

Skip C

Page 124: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Simultaneous Multithreading (SMT)• Instructions from multiple threads

issued on same cycle• Uses register renaming

• dynamic scheduling facility of multi-issue architecture

• Hardware support:• Register files, PCs per thread

• Temporary result registers pre commit

• Support to sort out which threads get results from which instructions

• Maximal util. of execution units

Skip A

Skip C

Page 125: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Simultaneous Multithreading (SMT)• Instructions from multiple threads

issued on same cycle• Uses register renaming

• dynamic scheduling facility of multi-issue architecture

• Hardware support:• Register files, PCs per thread

• Temporary result registers pre commit

• Support to sort out which threads get results from which instructions

• Maximal util. of execution units

Skip A

Skip C

Page 126: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Why Vector and Multithreading Background?

GPU:

• A very wide vector machine

• Massively multi-threaded to hide memory latency

• Originally designed for graphics pipelines…

Page 127: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Graphics ~= Rendering

3510/30/2018

Page 128: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Graphics ~= Rendering

Inputs

3510/30/2018

Page 129: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Graphics ~= Rendering

Inputs• 3D world model(objects, materials)

• Geometry modeled w triangle meshes, surface normals• GPUs subdivide triangles into “fragments” (rasterization)• Materials modeled with “textures”• Texture coordinates, sampling “map” textures

geometry

3510/30/2018

Page 130: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Graphics ~= Rendering

Inputs• 3D world model(objects, materials)

• Geometry modeled w triangle meshes, surface normals• GPUs subdivide triangles into “fragments” (rasterization)• Materials modeled with “textures”• Texture coordinates, sampling “map” textures

geometry

• Light locations and properties• Attempt to model surtface/light interactions with

modeled objects/materials

3510/30/2018

Page 131: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Graphics ~= Rendering

Inputs• 3D world model(objects, materials)

• Geometry modeled w triangle meshes, surface normals• GPUs subdivide triangles into “fragments” (rasterization)• Materials modeled with “textures”• Texture coordinates, sampling “map” textures

geometry

• Light locations and properties• Attempt to model surtface/light interactions with

modeled objects/materials

• View point

3510/30/2018

Page 132: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Graphics ~= Rendering

Inputs• 3D world model(objects, materials)

• Geometry modeled w triangle meshes, surface normals• GPUs subdivide triangles into “fragments” (rasterization)• Materials modeled with “textures”• Texture coordinates, sampling “map” textures

geometry

• Light locations and properties• Attempt to model surtface/light interactions with

modeled objects/materials

• View point

Output

3510/30/2018

Page 133: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Graphics ~= Rendering

Inputs• 3D world model(objects, materials)

• Geometry modeled w triangle meshes, surface normals• GPUs subdivide triangles into “fragments” (rasterization)• Materials modeled with “textures”• Texture coordinates, sampling “map” textures

geometry

• Light locations and properties• Attempt to model surtface/light interactions with

modeled objects/materials

• View point

Output• 2D projection seen from the view-point

3510/30/2018

Page 134: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Graphics ~= Rendering

Inputs• 3D world model(objects, materials)

• Geometry modeled w triangle meshes, surface normals• GPUs subdivide triangles into “fragments” (rasterization)• Materials modeled with “textures”• Texture coordinates, sampling “map” textures

geometry

• Light locations and properties• Attempt to model surtface/light interactions with

modeled objects/materials

• View point

Output• 2D projection seen from the view-point

3510/30/2018

Page 135: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Grossly over-simplified rendering algorithm

Dandelion 3610/30/2018

Page 136: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Grossly over-simplified rendering algorithm

foreach(vertex v in model)

Dandelion 3610/30/2018

Page 137: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Grossly over-simplified rendering algorithm

foreach(vertex v in model)

map vmodel vview

Dandelion 3610/30/2018

Page 138: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Grossly over-simplified rendering algorithm

foreach(vertex v in model)

map vmodel vview

Dandelion 3610/30/2018

Page 139: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Grossly over-simplified rendering algorithm

foreach(vertex v in model)

map vmodel vview

fragment[] frags = {};

Dandelion 3610/30/2018

Page 140: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Grossly over-simplified rendering algorithm

foreach(vertex v in model)

map vmodel vview

fragment[] frags = {};

foreach triangle t (v0, v1, v2)

Dandelion 3610/30/2018

Page 141: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Grossly over-simplified rendering algorithm

foreach(vertex v in model)

map vmodel vview

fragment[] frags = {};

foreach triangle t (v0, v1, v2)

frags.add(rasterize(t));

Dandelion 3610/30/2018

Page 142: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Grossly over-simplified rendering algorithm

foreach(vertex v in model)

map vmodel vview

fragment[] frags = {};

foreach triangle t (v0, v1, v2)

frags.add(rasterize(t));

foreach fragment f in frags

Dandelion 3610/30/2018

Page 143: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Grossly over-simplified rendering algorithm

foreach(vertex v in model)

map vmodel vview

fragment[] frags = {};

foreach triangle t (v0, v1, v2)

frags.add(rasterize(t));

foreach fragment f in frags

choose_color(f);

Dandelion 3610/30/2018

Page 144: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Grossly over-simplified rendering algorithm

foreach(vertex v in model)

map vmodel vview

fragment[] frags = {};

foreach triangle t (v0, v1, v2)

frags.add(rasterize(t));

foreach fragment f in frags

choose_color(f);

display(visible_fragments(frags));

Dandelion 3610/30/2018

Page 145: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Grossly over-simplified rendering algorithm

foreach(vertex v in model)

map vmodel vview

fragment[] frags = {};

foreach triangle t (v0, v1, v2)

frags.add(rasterize(t));

foreach fragment f in frags

choose_color(f);

display(visible_fragments(frags));

Dandelion 3610/30/2018

Page 146: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Algorithm Graphics Pipelineforeach(vertex v in model)

map vmodel vview

fragment[] frags = {};

foreach triangle t (v0, v1, v2)

frags.add(rasterize(t));

foreach fragment f in frags

choose_color(f);

display(visible_fragments(frags));

Dandelion 37

OpenGL pipeline

To first order, DirectX looks the same!

10/30/2018

Page 147: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Algorithm Graphics Pipelineforeach(vertex v in model)

map vmodel vview

fragment[] frags = {};

foreach triangle t (v0, v1, v2)

frags.add(rasterize(t));

foreach fragment f in frags

choose_color(f);

display(visible_fragments(frags));

Dandelion 37

OpenGL pipeline

To first order, DirectX looks the same!

10/30/2018

Page 148: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Algorithm Graphics Pipelineforeach(vertex v in model)

map vmodel vview

fragment[] frags = {};

foreach triangle t (v0, v1, v2)

frags.add(rasterize(t));

foreach fragment f in frags

choose_color(f);

display(visible_fragments(frags));

Dandelion 37

OpenGL pipeline

To first order, DirectX looks the same!

10/30/2018

Page 149: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Algorithm Graphics Pipelineforeach(vertex v in model)

map vmodel vview

fragment[] frags = {};

foreach triangle t (v0, v1, v2)

frags.add(rasterize(t));

foreach fragment f in frags

choose_color(f);

display(visible_fragments(frags));

Dandelion 37

OpenGL pipeline

To first order, DirectX looks the same!

10/30/2018

Page 150: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Algorithm Graphics Pipelineforeach(vertex v in model)

map vmodel vview

fragment[] frags = {};

foreach triangle t (v0, v1, v2)

frags.add(rasterize(t));

foreach fragment f in frags

choose_color(f);

display(visible_fragments(frags));

Dandelion 37

OpenGL pipeline

To first order, DirectX looks the same!

10/30/2018

Page 151: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Graphics pipeline GPU architecture

Dandelion 38

Limited “programmability” of shaders:Minimal/no control flowMaximum instruction count

GeForce 6 series

10/30/2018

Page 152: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Graphics pipeline GPU architecture

Dandelion 38

Limited “programmability” of shaders:Minimal/no control flowMaximum instruction count

GeForce 6 series

10/30/2018

Page 153: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Graphics pipeline GPU architecture

Dandelion 38

Limited “programmability” of shaders:Minimal/no control flowMaximum instruction count

GeForce 6 series

10/30/2018

Page 154: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Graphics pipeline GPU architecture

Dandelion 38

Limited “programmability” of shaders:Minimal/no control flowMaximum instruction count

GeForce 6 series

10/30/2018

Page 155: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Graphics pipeline GPU architecture

Dandelion 38

Limited “programmability” of shaders:Minimal/no control flowMaximum instruction count

GeForce 6 series

10/30/2018

Page 156: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Late Modernity: unified shaders

Dandelion 39

Mapping to Graphics pipeline no longer apparentProcessing elements no longer specialized to a particular roleModel supports real control flow, larger instr count10/30/2018

Page 157: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Mostly Modern: Pascal

Page 158: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Definitely Modern: Turing

Page 159: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Modern Enough: Pascal SM

Page 160: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Cross-generational observations

GPUs designed for parallelism in graphics pipeline:

• Data• Per-vertex• Per-fragment• Per-pixel

• Task• Vertex processing• Fragment processing• Rasterization• Hidden-surface elimination

• MLP• HW multi-threading for hiding memory latency

Dandelion 4310/30/2018

Page 161: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Cross-generational observations

GPUs designed for parallelism in graphics pipeline:

• Data• Per-vertex• Per-fragment• Per-pixel

• Task• Vertex processing• Fragment processing• Rasterization• Hidden-surface elimination

• MLP• HW multi-threading for hiding memory latency

Dandelion 43

Even as GPU architectures become more general, certain assumptions persist:1. Data parallelism is trivially exposed2. All problems look like painting a box

with colored dots

10/30/2018

Page 162: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Cross-generational observations

GPUs designed for parallelism in graphics pipeline:

• Data• Per-vertex• Per-fragment• Per-pixel

• Task• Vertex processing• Fragment processing• Rasterization• Hidden-surface elimination

• MLP• HW multi-threading for hiding memory latency

Dandelion 43

Even as GPU architectures become more general, certain assumptions persist:1. Data parallelism is trivially exposed2. All problems look like painting a box

with colored dots

But what if my problem isn’t painting a box?!!?!

10/30/2018

Page 163: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

The big ideas still present in GPUs

• Simple cores

• Single instruction stream• Vector instructions (SIMD) OR

• Implicit HW-managed sharing (SIMT)

• Hide memory latency with HW multi-threading

Dandelion 4410/30/2018

Page 164: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Programming Model

• GPUs are I/O devices, managed by user-code

• “kernels” == “shader programs”

• 1000s of HW-scheduled threads per kernel

• Threads grouped into independent blocks.• Threads in a block can synchronize (barrier)

• This is the *only* synchronization

• “Grid” == “launch” == “invocation” of a kernel • a group of blocks (or warps)

Dandelion 5110/30/2018

Page 165: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Parallel Algorithms

• Sequential algorithms often do not permit easy parallelization• Does not mean there work has no parallelism• A different approach can yield parallelism• but often changes the algorithm • Parallelizing != just adding locks to a sequential algorithm

• Parallel Patterns• Map• Scatter, Gather• Reduction• Scan• Search, Sort

Page 166: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Parallel Algorithms

• Sequential algorithms often do not permit easy parallelization• Does not mean there work has no parallelism• A different approach can yield parallelism• but often changes the algorithm • Parallelizing != just adding locks to a sequential algorithm

• Parallel Patterns• Map• Scatter, Gather• Reduction• Scan• Search, Sort

If you can express your algorithm using these patterns,

an apparently fundamentally sequential algorithm can be

made parallel

Page 167: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Map

• Inputs• Array A

• Function f(x)

• map(A, f) apply f(x) on all elements in A

• Parallelism trivially exposed• f(x) can be applied in parallel to all elements, in principle

Page 168: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Map

• Inputs• Array A

• Function f(x)

• map(A, f) apply f(x) on all elements in A

• Parallelism trivially exposed• f(x) can be applied in parallel to all elements, in principle

for(i=0; i<numPoints; i++) {labels[i] = findNearestCenter(points[i]);

}

map(points, findNearestCenter)

Page 169: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Scatter and Gather

• Gather:• Read multiple items to single location

• Scatter:• Write single data item to multiple locations

Page 170: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Scatter and Gather

• Gather:• Read multiple items to single location

• Scatter:• Write single data item to multiple locations

for (i=0; i<N; ++i)x[i] = y[idx[i]];

for (i=0; i<N; ++i)y[idx[i]] = x[i];

gather(x, y, idx)

scatter(x, y, idx)

Page 171: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Reduce

• Input• Associative operator op

• Ordered set s = [a, b, c, … z]

• Reduce(op, s) returns a op b op c … op z

Page 172: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Reduce

• Input• Associative operator op

• Ordered set s = [a, b, c, … z]

• Reduce(op, s) returns a op b op c … op z

for(i=0; i<N; ++i) {accum += (point[i]*point[i])

}accum = reduce(*, point)

Page 173: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Reduce

• Input• Associative operator op

• Ordered set s = [a, b, c, … z]

• Reduce(op, s) returns a op b op c … op z

for(i=0; i<N; ++i) {accum += (point[i]*point[i])

}accum = reduce(*, point)

Why must op be associative?

Page 174: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Reduce

• Input• Associative operator op

• Ordered set s = [a, b, c, … z]

• Reduce(op, s) returns a op b op c … op z

for(i=0; i<N; ++i) {accum += (point[i]*point[i])

}accum = reduce(*, point)

Why must op be associative?

Page 175: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Scan (prefix sum)

• Input• Associative operator op

• Ordered set s = [a, b, c, … z]

• Identity I

• scan(op, s) = [I, a, (a op b), (a op b op c) …]

• Scan is the workhorse of parallel algorithms:• Sort, histograms, sparse matrix, string compare, …

Page 176: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web

Summary

• Re-expressing apparently sequential algorithms as combinations of parallel patterns is a common technique when targeting GPUs