“Going Parallel with C++11” Supercomputing 2012

“Going Parallel with C++11” SUPERCOMPUTING 2012

Joe Hummel, PhDUC-Irvine

[email protected]

http://www.joehummel.net/downloads.html

2

New standard of C++ has been ratified◦ “C++0x” ==> “C++11”

Lots of new features Workshop will focus on concurrency

features

C++ 11

Going Parallel with C++11

3

Why are we here? Async programming:

Better responsiveness…

GUIs (desktop, web, mobile) Cloud Windows 8

Parallel programming:

Better performance…

Financials Scientific Big data

C CC C C C

C C


4

Hello World --- of threading

#include <thread>#include <iostream>

void func(){ std::cout << "**Inside thread " << std::this_thread::get_id() << "!" << std::endl;}

int main(){ std::thread t; t = std::thread( func );

t.join(); return 0;}

A simple function for thread to do…

Create and schedule thread…

Wait for thread to finish…


5

Hello world…

Demo #1


6

Avoiding early program termination…#include <thread>#include <iostream>

void func(){ std::cout << "**Hello world...\n";}

int main(){ std::thread t; t = std::thread( func );

t.join(); return 0;}

(1) Thread function must do exception handling; unhandled exceptions ==> termination…

void func(){ try { // computation: } catch(...) { // do something: }}

(2) Must join, otherwise termination… (avoid use of detach( ), difficult to use safely)


7

Old school:◦ distinct thread functions (what we just saw)

New school:◦ lambda expressions (aka anonymous functions)

Pick your style…


8

A thread that loops until we tell it to stop…

Demo #2

When user presses ENTER, we’ll tell thread to stop…


9

(1) via thread function#include <thread>#include <iostream>#include <string>

using namespace std; . . .

int main(){ bool stop(false); thread t(loopUntil, &stop);

getchar(); // wait for user to press enter:

stop = true; // stop thread: t.join(); return 0;}

void loopUntil(bool *stop){ auto duration = chrono::seconds(2);

while (!(*stop)) { cout << "**Inside thread...\n"; this_thread::sleep_for(duration); }}


10

(2) via lambda expressionint main(){

bool stop(false); thread t = thread( [&]() { auto duration = chrono::seconds(2);

while (!stop) { cout << "**Inside thread...\n"; this_thread::sleep_for(duration); } } );

getchar(); // wait for user to press enter:

stop = true; // stop thread: t.join(); return 0;}

thread t( [&] () { auto duration = chrono::seconds(2);

while (!stop) { cout << "**Inside thread...\n"; this_thread::sleep_for(duration); } } );

lambda expression

Closure semantics: [ ]: none, [&]: by ref, [=]: by val, …

lambda arguments

11

Lambdas:◦ Easier and more readable -- code remains inline◦ Potentially more dangerous ([&] captures everything by

ref)

Functions:◦ More efficient -- lambdas involve class, function objects◦ Potentially safer -- requires explicit variable scoping◦ More cumbersome and illegible

Trade-offs


12

Multiple threads looping…

Demo #3

When user presses ENTER, all threads

stop…


13

Solution#include <thread>#include <iostream>#include <string>#include <vector>#include <algorithm>

using namespace std;

int main(){ cout << "** Main Starting **\n\n"; bool stop = false; . . .

getchar();

cout << "** Main Done **\n\n"; . . . return 0;}

vector<thread> workers;

for (int i = 1; i <= 3; i++){ workers.push_back( thread([i, &stop]() { while (!stop) { cout << "**Inside thread " << i << "...\n"; this_thread::sleep_for(chrono::seconds(i)); } }) );}

stop = true; // stop threads:

// wait for threads to complete:for ( thread& t : workers ) t.join();


14

Matrix multiply…

A Real Example


15

Multi-threaded solutionint rows = N / numthreads;int extra = N % numthreads;int start = 0; // each thread does [start..end)int end = rows;

vector<thread> workers;

for (int t = 1; t <= numthreads; t++){

if (t == numthreads) // last thread does extra rows:end += extra;

workers.push_back( thread([start, end, N, &C, &A, &B](){

for (int i = start; i < end; i++)for (int j = 0; j < N; j++){

C[i][j] = 0.0;for (int k = 0; k < N; k++)

C[i][j] += (A[i][k] * B[k][j]);}

}));

start = end;end = start + rows;

}for (thread& t : workers) t.join();

// 1 thread per core:numthreads = thread::hardware_concurrency();

16

Parallelism alone is not enough…

High-Performance Computing

HPC == Parallelism + Memory Hierarchy ─ Contention

Expose parallelism

Maximize data locality:• network• disk• RAM• cache• core

Minimize interaction:• false sharing• locking• synchronization


17

Cache-friendly matrix multiply

XGoing Parallel with C++11

18

Loop interchange is first step…

Cache-friendly solution

workers.push_back( thread([start, end, N, &C, &A, &B](){

for (int i = start; i < end; i++)for (int j = 0; j < N; j++)

C[i][j] = 0.0;

for (int i = start; i < end; i++)for (int k = 0; k < N; k++)

for (int j = 0; j < N; j++)C[i][j] += (A[i][k] * B[k][j]);

}));

Next step is to block multiply…


19

C++11 Features and Status


20

No compiler as yet fully implements C++11

Visual C++ 2012 has best concurrency support◦ Part of Visual Studio 2012

gcc 4.7 has best overall support◦ http://gcc.gnu.org/projects/cxx0x.html

clang 3.1 appears very good as well◦ I did not test◦ http://clang.llvm.org/cxx_status.html

Compilers…


http://gcc.gnu.org/projects/cxx0x.html

http://gcc.gnu.org/projects/cxx0x.html

http://clang.llvm.org/cxx_status.html

http://clang.llvm.org/cxx_status.html

21

Compiling with gcc# makefile

# threading library: one of these should work# tlib=threadtlib=pthread

# gcc 4.6:ver=c++0x# gcc 4.7:# ver=c++11

build:g++ -std=$(ver) -Wall main.cpp -l$(tlib)


22

Executive SummaryConcept Header Summary

Threads <thread> Standard, low-level, type-safe; good basis for building HL systems (futures, tasks, …)

Futures <future> Via async function; hides threading, better harvesting of return value & exception handling

Locking <mutex> Standard, low-level locking primitives

Condition Vars <condition_variable> Low-level synchronization primitives

Atomics <atomic> Predictable, concurrent access without data race

Memory Model “Catch Fire” semantics; if program contains a data race, behavior of memory is undefined

Thread Local Thread-local variables [ problematic => avoid ]


23

Use mutex to protect against concurrent access…

Locking

thread t1([&]() { m.lock(); sum += compute(); m.unlock(); });

#include <mutex>mutex m;int sum;

thread t2([&]() { m.lock(); sum += compute(); m.unlock(); });


24

“Resource Acquisition Is Initialization”◦ Advocated by B. Stroustrup for resource management◦ Uses constructor & destructor to properly manage resources

(files, threads, locks, …) in presence of exceptions, etc.

RAII

thread t([&](){ m.lock(); sum += compute(); m.unlock();});

thread t([&]() { lock_guard<mutex> lg(m); sum += compute(); });

should be written as…

Locks m in constructor

Unlocks m in destructor


25

Use atomic to protect shared variables…◦ Lighter-weight than locking, but much more limited in applicability

Atomics

thread t1([&]() { count++; });

#include <atomic>

atomic<int> count;count = 0;

thread t2([&]() { count++; });

thread t3([&]() { count = count + 1; });Xnot safe…Going Parallel with C+

+11

26

Atomics enable safe, lock-free programming◦ “Safe” is a relative word…

Lock-free programming

thread t1([&]() { x = 42; done = true; });

thread t2([&]() { while (!done) ; assert(x==42); });

int x;atomic<bool> done;done = false;

doneflag

thread t1([&]() { if (!initd) { lock_guard<mutex> _(m); x = 42; initd = true; } << consume x, … >> });

int x;atomic<bool> initd;initd = false;

thread t2([&]() { if (!initd) { lock_guard<mutex> _(m); x = 42; initd = true; } << consume x, … >> });

lazyinit

27

Demo Prime numbers…


28

Futures provide a higher-level of abstraction◦ Starts an asynchronous operation on some thread, await result…

Futures

#include <future> . .

future<int> fut = async( []() -> int { int result = PerformLongRunningOperation(); return result; });..

try{ int x = fut.get(); // join, harvest result: cout << x << endl;}catch(exception &e){ cout << "**Exception: " << e.what() << endl;}

return type…


29

May run on the current thread May run on a new thread Often better to let system decide…

Execution of futures

// run on current thread when someone asks for value (“lazy”):future<T> fut1 = async( launch::sync, []() -> ... );future<T> fut2 = async( launch::deferred, []() -> ... );

// run on a new thread:future<T> fut3 = async( launch::async, []() -> ... );

// let system decide:future<T> fut4 = async( launch::any, []() -> ... );future<T> fut5 = async( []() ... );


30

Demo

NetflixMovieReview

s(.txt)

Netflix Data

Mining App

Computes average review for a movie…

Netflix data-mining…


31

C++ committee thought long and hard on memory model semantics…

◦ “You Don’t Know Jack About Shared Variables or Memory Models”, Boehm and Adve, CACM, Feb 2012

Conclusion:◦ No suitable definition in presence of race conditions

Solution:◦ Predictable memory model *only* in data-race-free

codes◦ Computer may “catch fire” in presence of data races

Memory Model


32

Example…

thread t1([&]() { x = 1; r1 = y; });t1.join();

int x, y, r1, r2;x = y = r1 = r2 = 0;

thread t2([&]() { y = 1; r2 = x; });t2.join();

What can we say about r1

and r2?Going Parallel with C++11

33

Dekker’s example…

If we think in terms of all possible thread interleavings

(aka “sequential consistency”), then we know r1 = 1, r2 = 1, or both

In C++ 11? Not only are the values of x, y, r1 and r2

undefined, but the program may crash!


34

A program is data-race-free (DRF) if no sequentially-consistent execution results in a data race. Avoid anything else.

C++ 11 Memory ModelDef: two memory accesses conflict if they

1. access the same scalar object or contiguous sequence of bit fields, and2. at least one access is a store.

Def: two memory accesses participate in a data race if they 1. conflict, and2. can occur simultaneously.

via independent threads, locks, atomics, …


35

Beyond Threads


36

Tasks are a higher-level abstraction

◦ Idea: developers identify work run-time system deals with execution details

Tasks vs. Threads

Task: a unit of work; an object denoting an ongoing operation or computation.


37

Microsoft PPL: Parallel Patterns Library

Example

#include <ppl.h>

for (int i = 0; i < N; i++)for (int j = 0; j < N; j++)

C[i][j] = 0.0;

// for (int i = 0; i < N; i++)Concurrency::parallel_for(0, N, [&](int i){for (int k = 0; k < N; k++)for (int j = 0; j < N; j++)C[i][j] += (A[i][k] * B[k][j]);

});Matrix Multiply


38

Execution Model

C CC C C C

C C

Windows Process

Thread Pool

workerthread

workerthread

workerthread

workerthread

parallel_for( ... ); tasktasktasktask

global work queue

Parallel Patterns Library

Resource Manager

Task Scheduler

Windows

39

That’s it!


40

Presenter: Joe Hummel◦ Email: [email protected]◦ Materials: http://www.joehummel.net/downloads.html

References:◦ Book: “C++ Concurrency in Action”, by Anthony Williams◦ Talks: Bjarne and friends at MSFT’s “Going Native 2012”

http://channel9.msdn.com/Events/GoingNative/GoingNative-2012

◦ Tutorials: really nice series by Bartosz Milewski http://bartoszmilewski.com/2011/08/29/c11-concurrency-tutorial/

◦ FAQ: Bjarne Stroustrup’s extensive FAQ http://www.stroustrup.com/C++11FAQ.html

Thank you for attending




http://bartoszmilewski.com/2011/08/29/c11-concurrency-tutorial/

http://bartoszmilewski.com/2011/08/29/c11-concurrency-tutorial/

http://www.stroustrup.com/C++11FAQ.html

http://www.stroustrup.com/C++11FAQ.html

“Going Parallel with C++11” Supercomputing 2012

Documents

parallel programming

new standard of c

concurrency features

async programming

new featuresworkshop

joe hummel

hello world

void func