8/14/2019 TD MXC Parallel Programming Tatkar
1/55
Sun Tech Days 07-08 /Sun Studio - # 1
How to Develop SolarisParallel Applications
Vijay TatkarSr. Engineering Manager
Sun Studio Developer Toolshttp://blogs.sun.com/tatkar 1
8/14/2019 TD MXC Parallel Programming Tatkar
2/55
Sun Tech Days 07- 08 /Sun Studio - # 2
The GHz Chip Clock Race is Over...
Classic CPU efficiencies:Clock speedExecution optimizationCache
Design Impediments:HeatPowerSlower Memory than chips
Where is my 10GHz chip?
8/14/2019 TD MXC Parallel Programming Tatkar
3/55
Sun Tech Days 07- 08 /Sun Studio - # 3
Putting transistors to work in a new way ...
The MulticoreRevolution
UltraSPARC T2
1.4GHz * 8 cores(64 threads in a chip)
Intel: Penryn, AMD: BarcelonaIntel: 4 cores * 3.1GHzAMD: 4 cores * 2.3GHz
(4 threads in a chip)
Every new system now has a multi-core chip in it
8/14/2019 TD MXC Parallel Programming Tatkar
4/55
Sun Tech Days 07- 08 /Sun Studio - # 4
Things to know about Parallelism
Parallel processing is not for massively parallel supercomputer anymore.(HPC High Priced Computing)
CPU clock speed doubled every 18 months, whereas memory doubled every6 years! Heat, Memory, Power lead to multi-cores CPUs.
Free ride is overfor serial programs relying on the hardware to boostperformance.
Parallel programming is BEST BET for speedups> Parallelism is all about performance, first and foremost> Program correctness is often harder for parallel programs
Parallelism is often considered hard, but there are several models to choosefrom, and compiler support for each model to ease the choice.
8/14/2019 TD MXC Parallel Programming Tatkar
5/55
Sun Tech Days 07- 08 /Sun Studio - # 5
Programming Model
Shared Memory ModelOpenMP (de-facto standard)Java, Native Multi-threaded Programming
Distributed Memory Model
Message Passing Interface MPI (de-facto standard)Parallel Virtual Machine PVM (less popular)
Global Address SpaceUnified Parallel C UPC (research technology)
Grid ComputingSun Grid Computing (www.network.com)
Sun Grid Engine (www.sun.com/software/gridware)
8/14/2019 TD MXC Parallel Programming Tatkar
6/55
8/14/2019 TD MXC Parallel Programming Tatkar
7/55Sun Tech Days 07- 08 /Sun Studio - # 7
Easiest Hardest
SolarisEvent
Ports
Posix
Threads
Solaris
Threads
Atomic
Operations
libumem
Application
UltraSPARC T1/T2
SPARC64 VI,
UltraSPARC IV+
Intel/AMD
x86/x64
Sun Studio Developer Tools
MT MPIInstruction-levelParallelismAutomatic
Parallelization,AutomaticVectorization
Tuned MT libraries
OpenMP
Automatic Parallelization and Vectorization
8/14/2019 TD MXC Parallel Programming Tatkar
8/55Sun Tech Days 07- 08 /Sun Studio - # 8
Instruction level Parallelism
Chips have figured out how to dispatch multiple instructions in parallelCompilers have figured out how to schedule for such processors
Chips + Compilers are very mature in this regard, so there is no programmeraction required and the gain is automatic, whereever possible
It IS possible to chew gum and walk at the same time!
8/14/2019 TD MXC Parallel Programming Tatkar
9/55Sun Tech Days 07- 08 /Sun Studio - # 9
Automatic Parallelization
Support for the Fortran, C and C++ applicationsFirst introduced for 4-20 way SPARCserver 600 MP in 1991
Useful for loop oriented programs
Every (nested) loop will be analyzed for data dependencies andparallelized if safe to do so
Non-loop code fragments will not be analyzedLoops versioned with serial and parallel code (runtime)
Combine with powerful loop optimizations
One can have subtle interactions between loop transformations andparallelization
Compilers have limited knowledge about the application
Overall gains can be impressive
Entire SPECfp 2006 suite gains 16% with PARALLEL=2
Individual gains can be upto 2x for suitable programs; libquantum from
SPEC CPU2006 speeds up 6-7x on 8-cores!Not ever ro ram will see a ain
8/14/2019 TD MXC Parallel Programming Tatkar
10/55Sun Tech Days 07- 08 /Sun Studio - # 10
Automatic Parallelization Options
-xautopar
Automatic parallelization (Fortran, C and C++ compiler) requires -xO3 orhigher (-xautopar implies -xdepend)
-xreduction
Parallelize reduction operationsRecommended to use -fsimple=2 as well
-xloopinfo
Show parallelization messages on screen
Only apply to the most time consuming parts of program
8/14/2019 TD MXC Parallel Programming Tatkar
11/55Sun Tech Days 07- 08 /Sun Studio - # 11
AutoPar: SPECfp 2006 improvements
bwaves
gamess
milc zeusmp
gro-mac
cac-tusADM
leslie3d
namd
dealII
so-plex
povr
cal-culix
gemsFDT
tonto lbm wrf sphinx3
0
2.5
5
7.5
10
12.5
15
17.5
20
22.5
25
27.5
Woodcrest box: 3.0GHz dual-core
PARALLEL=2
Overall Gain: 16%
Base Flags
+ Autopar
8/14/2019 TD MXC Parallel Programming Tatkar
12/55Sun Tech Days 07- 08 /Sun Studio - # 12
Automatic Vectorization
Support for the Fortran, C and C++ applications
-xvector=simd exploits special SSE2+ instructions
Works on data in adjacent memory locations
Gains are smaller than -xautopar
SPECfp 2006 gains are 3% overall and upto 14% range individually
Best suited for loop-level SIMD parallelism
for (i=0; i
8/14/2019 TD MXC Parallel Programming Tatkar
13/55Sun Tech Days 07- 08 /Sun Studio - # 13
Case Study:
Vectorizing STREAM
8/14/2019 TD MXC Parallel Programming Tatkar
14/55Sun Tech Days 07- 08 /Sun Studio - # 14
Tuned MT Libraries Sun Perf Lib
8/14/2019 TD MXC Parallel Programming Tatkar
15/55Sun Tech Days 07- 08 /Sun Studio - # 15
Easiest Hardest
Solaris
Event
Ports
Posix
Threads
Solaris
ThreadsAtomic
Operations
libumem
Application
AutoPar MPIMT
OpenMP
UltraSPARC T1/T2
SPARC64 VI,
UltraSPARC IV+
Intel/AMD
x86/x64
Sun Studio Developer Tools
Compiler Support : OpenMP
8/14/2019 TD MXC Parallel Programming Tatkar
16/55
Sun Tech Days 07- 08 /Sun Studio - # 16
What is OpenMP?
Defacto industry standard API for writing shared-memory parallel applicationsin C, C++ and Fortran See: http://www.openmp.org
Consists of>Compiler directives (pragmas)>
Runtime routines (libmtsk)>Environment variables
Advantages:> Incremental parallelization of source code>Small(er) amount of programming effort
>Good Performance and Scalability>Portable across variety of vendor compilers
Sun Studio has consistently led OpenMP> Support for latest version (2.5 now, v3.0 API underway)
> Consistent World Record SPEC OMP submissions for several years now
8/14/2019 TD MXC Parallel Programming Tatkar
17/55
Sun Tech Days 07- 08 /Sun Studio - # 17
OpenMP- Directives with Intelligence
8/14/2019 TD MXC Parallel Programming Tatkar
18/55
Sun Tech Days 07- 08 /Sun Studio - # 18
A Loop Parallelized With OpenMP
#pragma omp parallel default (none) \shared(n, x, y) private (i){#pragma omp for
for ( i = 0; i < n; i++)
x[i] += y[i];} /*-- End of Parallel region -- */
!$omp parallel default (none) &!$omp shared(n,x,y) private(i)!$omp do
do i = 1, nx(i) = x(i) + y(i)
end do!$ end do!$ end parallel
C/C++
Fortran
Clauses
8/14/2019 TD MXC Parallel Programming Tatkar
19/55
8/14/2019 TD MXC Parallel Programming Tatkar
20/55
Sun Tech Days 07- 08 /Sun Studio - # 20
An OpenMP Example
Find the primes up to 3,000,000 (216816) Run on Sun Fire 6800, Solaris 9, 24 processors 1.2GHz US-III+, with 9.8GB
main memory
Model # threads Time (secs) % changeSerial N/A 6.636 Base
OpenMP
1 7.210 8.65% drop
2 3.771 1.76x faster
4 1.988 3.34x faster8 1.090 6.09x faster
16 0.638 10.40x faster
20 0.550 12.06x faster
24 0.931 Saturation drop
8/14/2019 TD MXC Parallel Programming Tatkar
21/55
Sun Tech Days 07- 08 /Sun Studio - # 21
Easiest Hardest
Solaris
Event
Ports
Posix
Threads
Solaris
ThreadsAtomic
Operations
libumem
Application
AutoPar MPIOpenMP
MT
UltraSPARC T1/T2
SPARC64 VI,
UltraSPARC IV+
Intel/AMD
x86/x64
Sun Studio Developer Tools
Compiler Support : Programming Threads
8/14/2019 TD MXC Parallel Programming Tatkar
22/55
Sun Tech Days 07- 08 /Sun Studio - # 22
Programming Threads
Use the POSIX APIs pthread_create, pthread_join,pthread_exit, et. al.> Recommendation: consider reducing the thread stack size
(default is 1MB)
> See pt hr ead_at t r _i ni t (3C) for this and other attributeswhich can be adjusted
Do not use the native Solaris threading API (e.g.,thr_create).
> Though applications which use it are still supported, it is non-portable.
8/14/2019 TD MXC Parallel Programming Tatkar
23/55
Sun Tech Days 07- 08 /Sun Studio - # 23
Data Synchronization
Concurrent access to shared data requiressynchronization> Mutexes (pthread_mutex_lock/pthread_mutex_unlock)> Condition Variables (pthread_cond_wait)> Reader/Writer locks
(pthread_rwlock_rdlock/pthread_rwlock_wrlock)> Spin locks (pthread_spin_lock)
Objects can be local to a process or shared betweenprocesses via shared memory.
8/14/2019 TD MXC Parallel Programming Tatkar
24/55
Sun Tech Days 07- 08 /Sun Studio - # 24
MT Demo
Multithreading Primes
8/14/2019 TD MXC Parallel Programming Tatkar
25/55
Sun Tech Days 07- 08 /Sun Studio - # 25
int is_prime(int v)
{
int i;
int bound = floor (sqrt((double)v)) + 1;
for (i=2; i 1);
}
8/14/2019 TD MXC Parallel Programming Tatkar
26/55
Sun Tech Days 07- 08 /Sun Studio - # 26
void *work(void *arg){
int start;
int end;
int i;
int val= *((int *) arg);
start = (N/THREADS) * val;end = start + N/THREADS;
for (i = start; i < end ; i++) {
if ( is_prime(i) ) {
primes[total] = i;
total++;
}
}
return NULL;
}
8/14/2019 TD MXC Parallel Programming Tatkar
27/55
Sun Tech Days 07- 08 /Sun Studio - # 27
int main(int argc, char** argv)
for (i=0; i < N; i++) {pflag[i] = 1;
}
for (i = 0; i < (THREADS-1); i++) {
pthread_create(&tids[i], NULL, work, (void *) &i);
}
i = THREADS -1;
work((void *) &i);
for (i = 0; i < THREADS ; i++) {
pthread_join(tids[i], NULL);
}
8/14/2019 TD MXC Parallel Programming Tatkar
28/55
Sun Tech Days 07- 08 /Sun Studio - # 28
STOP!
Problem AheadRDT Demo, please
8/14/2019 TD MXC Parallel Programming Tatkar
29/55
Sun Tech Days 07- 08 /Sun Studio - # 29
Data Race Condition
A data race condition occurs when> multiple threads access a shared memory location> without synchronized accessing order> At least one access is to write a new data
A data race problem often occurs in shared memoryparallel programming models such as Pthread andOpenMP.> The effect of a data race problem is unpredictable and may
occur only once during hundreds of runs.
8/14/2019 TD MXC Parallel Programming Tatkar
30/55
Sun Tech Days 07- 08 /Sun Studio - # 30
Thread Analyzer
Detects data races and deadlocks in a multithreaded applicationPoints to non-deterministic or incorrect execution
Bugs are notoriously difficult to detect by examination
Points out actual and potential deadlock situations
Process:
Instrument the code with -xinstrument=dataraceDetect runtime condition with collect -r all [or race, detection]
Use the Graphical Analyzer, tha, to identify conflicts and critical
regions
Works with OpenMP, Pthreads, Solaris Threads
API provided for user-defined synchronization primitives
Works on Solaris (SPARC, x86/x64) and Linux
Static lock_lint tool to detect inconsistent use of locks
8/14/2019 TD MXC Parallel Programming Tatkar
31/55
Sun Tech Days 07- 08 /Sun Studio - # 31
A True SPEC Story
SPEC OMP Benchmark fma3d
101 source files; 61,000 lines of Fortran code
Data race in platq.f90 caused sporadic core dumps
It took several engineers and 6 weeks of work to find the data race manually
Perils of Having a DataRace Condition
Program exhibits non-deterministic behavior
Failure may be hard to reproduce
Program may continue to execute, leading to failure in unrelated codeA data race is hard to detect using conventional debugging methods andtools
8/14/2019 TD MXC Parallel Programming Tatkar
32/55
Sun Tech Days 07-08 /Sun Studio - # 32
How did Thread Analyzer help?
SPECOMP Benchmark fma3d
101 source files; 61,000 lines of Fortran code
Data race in platq.f90 caused sporadic core dumps
It took several engineers and 6 weeks of workto find the data race manually
With the Sun Studio Thread Analyzer, the data racewas detected in just a few hours!
8/14/2019 TD MXC Parallel Programming Tatkar
33/55
Sun Tech Days 07- 08 /Sun Studio - # 33
Easiest Hardest
SolarisEvent
Ports
Posix
Threads
Solaris
ThreadsAtomic
Operations
libumem
Application
UltraSPARC T1/T2
SPARC64 VI,
UltraSPARC IV+
Intel/AMD
x86/x64
Sun Studio Developer Tools
AutoPar OpenMPMT
MPI
Compiler Support : Message Passing Interface
8/14/2019 TD MXC Parallel Programming Tatkar
34/55
Sun Tech Days 07- 08 /Sun Studio - # 34
Message Passing Interface (MPI)
MPI programming model is a de-facto standard fordistributed memory parallel programming
MPI API set is quite large (323 subroutines)MPI application can be programmed with less than 10 differentcallsImplemented with very small set of device interconnect lowlevel routines.
Open MPI: http://www.open-mpi.org/
MPI home page at Argonne National Laboratorieshttp://www-unix.mcs.anl.gov/mpi/
http://www.open-mpi.org/http://www.open-mpi.org/8/14/2019 TD MXC Parallel Programming Tatkar
35/55
Sun Tech Days 07- 08 /Sun Studio - # 35
Message Passing Interface (MPI)
OpenMPI 2.0 Conformance
ClusterTools 7.0 with Sun Studio
Multiple processes runs under Open Runtime
Environment Pass data messages between processes in point/block
communication mode
No race condition with right use of MPI message passing
calls MPI profiling under Performance Analyzer
8/14/2019 TD MXC Parallel Programming Tatkar
36/55
Sun Tech Days 07- 08 /Sun Studio - # 36
Launching MPI application
For Single Program Multiple Data (SPMD)> mpirun -np x program1
For Multiple Program Multiple Data (MPMD)>
mpirun -np x program1 : -np y program 2 Launching on different nodes (hosts)
> mpirun -np x -host program1
And more ...
Very flexible way of launching
8/14/2019 TD MXC Parallel Programming Tatkar
37/55
Sun Tech Days 07- 08 /Sun Studio - # 37
Comparing OpenMP and MPI
OpenMP
Defacto industry standardLimited to one (SMP) system
Not (yet?) GRID-ready
Easier to get started
Assistance from compilers
Mix and match model
Requires data scoping
Increasingly popular (CMT?)
Preserves sequential code
Needs a compiler
No special environment
Performance issues implicit
MPI
Defacto industry standardRuns on number of systems
GRID ready
High and steep learning curve
You're on your own
All or nothing model
No data scoping required
More widely used (but...)
No sequential version
No compiler; just a library
Requires runtime environment
Easy to control performance
8/14/2019 TD MXC Parallel Programming Tatkar
38/55
Sun Tech Days 07-08 /Sun Studio - # 38
Thank you !
Vijay TatkarSr. Engineering Manager
Sun Studio Developer Tools
http://blogs.sun.com/tatkar 38
8/14/2019 TD MXC Parallel Programming Tatkar
39/55
Sun Tech Days 07- 08 /Sun Studio - # 39
Case Study:
AutoPar Matrix Multiply
8/14/2019 TD MXC Parallel Programming Tatkar
40/55
Sun Tech Days 07- 08 /Sun Studio - # 40
AutoPar Example Program
// Matrix Multiplication32 $define MAX 102433 void matrix_mul(float (*x_mat)[MAX],34 float(*y_mat)[MAX], float (*z_mat)[MAX]) {3536 for (int j = 0; j < MAX; j++) {37 for (int k = 0; k < MAX; k++) {38 z_mat[j][k] = 0.0;39 for (int t = 0; t < MAX; t++) {40 z_mat[j][k] += x_mat[j][t] * y_mat[t][k];41 }42 }43 }44 }
8/14/2019 TD MXC Parallel Programming Tatkar
41/55
Sun Tech Days 07- 08 /Sun Studio - # 41
AutoPar Example Compilation
CC -c mat_mul.cc -g -fast -xrestrict -xautopar-xloopinfo -o mat_mul.o
"mat_mul.cc", line 36: PARALLELIZED"mat_mul.cc", line 37: not parallelized, not profitable"mat_mul.cc", line 39: not parallelized, unsafe dependence
Can run er_src command on executable binaryto see internal compiler messages
8/14/2019 TD MXC Parallel Programming Tatkar
42/55
Sun Tech Days 07- 08 /Sun Studio - # 42
%CC mat_mul.cc -g -fast -xrestrict -xinline=no -o noautopar%CC mat_mul.cc -g -fast -xrestrict -xloopinfo -xautopar -xinline=no -o autopar
%ptime noautoparFinish multiplication of matrix of 1024
real 1.536user 1.521sys 0.018
%ptime autoparFinish multiplication of matrix of 1024
real 1.542user 1.520sys 0.016
%setenv PARALLEL 2%ptime autoparptime ./autoparFinish multiplication of matrix of 1024
real 0.817user 1.572
sys 0.016
8/14/2019 TD MXC Parallel Programming Tatkar
43/55
Sun Tech Days 07- 08 /Sun Studio - # 43
OpenMP Demo
Parallelizing Primes
8/14/2019 TD MXC Parallel Programming Tatkar
44/55
Sun Tech Days 07- 08 /Sun Studio - # 44
Parallelizing Primes Example (OpenMP)
Partition the problem space into smaller chunks anddispatch processing of each partition into individual(micro)tasks> A popular and practical example to illustrate how parallel
software deals with large data> The basic design concept of this program example can be
applied to many other parallel processing tasks.> The overall program structure is very simple
>A thread worker routine
>Main program creating multiple working threads/microtasks
8/14/2019 TD MXC Parallel Programming Tatkar
45/55
Sun Tech Days 07- 08 /Sun Studio - # 45
int main_omp(int argc, char** argv)
#ifdef _OPENMPomp_set_num_threads( NTHRS );
omp_set_dynamic(0);
#endif
for (i=0; i < N; i++) {
pflag[i] = 1;}
#pragma omp parallel for
for (i = 2; i < N ; i++) {
if ( is_prime(i) ) {
primes[total] = i;total++;
}
}
printf("Number of prime numbers between 2 and %d: %d \n", N, total);
8/14/2019 TD MXC Parallel Programming Tatkar
46/55
Sun Tech Days 07- 08 /Sun Studio - # 46
int is_prime(int v)
int is_prime(int v)
{ int i, bound = floor (sqrt((double)v)) + 1;
for (i=2; i 1);
}
8/14/2019 TD MXC Parallel Programming Tatkar
47/55
Sun Tech Days 07- 08 /Sun Studio - # 47
General Race Condition
A general race condition is caused by anundetermined sequence of executions that violate theprogram state integrity> Data race condition is a simple form of general race condition
> A general race problem can occur in shared memory and distributedmemory parallel programming
8/14/2019 TD MXC Parallel Programming Tatkar
48/55
8/14/2019 TD MXC Parallel Programming Tatkar
49/55
Sun Tech Days 07- 08 /Sun Studio - # 49
Design Practice to Avoid Races
Adopt a higher design abstraction such as OpenMP Use Pass-by-value instead of pass-by-pointer to
communicate between the threads
Design the data structure to limit the global variableusage and restrict the access of shared memory
Analyze a race problem to decide if it is a harmfulprogram bug or a benign race
Understand and fix the real cause of a race conditioninstead of fixing race condition symptom
8/14/2019 TD MXC Parallel Programming Tatkar
50/55
Sun Tech Days 07- 08 /Sun Studio - # 50
MPI: Single Program Multiple Data
The processes launchedare in the samecommunicator> mpirun -np 8 msorts
>The 8 processes launchedbelongs to theMPI_COMM_WORLDcommunicator
>8 ranks: 0, 1, 2, 3, 4, 5, 6, 7
>Total size: 8 All 8 processes running
the same program, controlflow differ by checking theranks.
MPI_Init(...);MPI_Comm_rank(
MPI_COMM_WORLD, &rank);MPI_Comm_size(
MPI_COMM_WORLD,&size);
if (rank == 0) {...
else if (rank == 1)...
else if (rank == 2)
...}MPI_Finailize();
8/14/2019 TD MXC Parallel Programming Tatkar
51/55
Sun Tech Days 07- 08 /Sun Studio - # 51
MPI Example: 7 Sorting Processes
Driver
Shakersort
Heapsort
StraightInsertion
Sort
Bubblesort
StraightSelection
Sort
QiucksortBinary
InsertionSort
All together 8 processes
8/14/2019 TD MXC Parallel Programming Tatkar
52/55
Sun Tech Days 07- 08 /Sun Studio - # 52
MPI Demo
7 Sorting Processes
8/14/2019 TD MXC Parallel Programming Tatkar
53/55
Sun Tech Days 07- 08 /Sun Studio - # 53
MPI: Non-Uniform Memory Performance
Reg L1 L2 Main Memory VirtualMemory
64 64KB
8MB
Tuning area
Perfor
mance
The length of a plateau is related to the size of thatmemory component
The amount of the drop is related to the latency(or bandwidth) of that memory component
MPI can help reduce program size
to fit into good regions
8/14/2019 TD MXC Parallel Programming Tatkar
54/55
Sun Tech Days 07- 08 /Sun Studio - # 54
Sun Studio and HPC
Sun HPC http://www.sun.com/servers/HPC/index.jspSun HPC ClusterTools 7 Softwarehttp://www.sun.com/software/products/clustertools
N1 Grid Engine Manager Software
Other MPI LibrariesOpen Source MPI-CH library for Solaris Sparchttp://www-unix.mcs.anl.gov/mpi/mpich
LAMMPI ported library for Solaris x86/x64
http://apstc.sun.com.sg/popup.php?l1=research&l2=projects&l3=s1
0port&f=applications#LAM/MPIMVAPICH MPI over InfiniBand for Solaris x86/x64
http://nowlab.cse.ohio-state.edu/projects/mpi-iba
http://www.sun.com/servers/HPC/index.jsphttp://www.sun.com/software/products/clustertoolshttp://www.sun.com/software/products/clustertoolshttp://www.sun.com/servers/HPC/index.jsp8/14/2019 TD MXC Parallel Programming Tatkar
55/55
Parallel Computing Environment
Global and Enterprise LevelGrid & SOA
Local ClusterGrid
N1Grid
MPI Appl OpenMP Appl MT Appl Serial Appl
Multi-Thread
OpenMP
Multi-Process
MPI
WebService
SOA
Grid
Loosely Coupled
Tightly Coupled
UPC/GAS