Computing Workshop for Users of NCAR’s SCD machines Christiane Jablonowski ([email protected]) NCAR ASP/SCD 31 January 2006 ML Mesa Lab, Chapman Room video.

Computing Workshop Computing Workshop for for

Users of NCAR’s SCD machinesUsers of NCAR’s SCD machines

Christiane Jablonowski ([email protected])

NCAR ASP/SCD

31 January 2006

ML Mesa Lab, Chapman Roomvideo conference facilities: FL EOL Atrium and CG1 3150

OverviewOverview

Current machine architectures at NCAR (SCD) Some basics on parallel computing Batch queuing systems at NCAR GAU resources & how to obtain a GAU account Insights into GAU charges The Mass Storage System How to monitor the GAUs Some practical tips on benchmarks, debugging tools,

restarts… ???

Computer architecturesComputer architectures

SCD’s machines are UNIX-based parallel computing architectures

Two types:– Hybrid (shared and distributed memory) machines like

bluesky (IBM Power4) bluevista (IBM Power5)lightning (AMD Opteron system running Linux)

– Shared memory system liketempest (SGI, 128 CPUs), predominantly used for post-processing jobs

http://www.scd.ucar.edu/computers/bluesky/

http://www.cisl.ucar.edu/computers/bluesky/

http://www.cisl.ucar.edu/computers/bluevista/

http://www.cisl.ucar.edu/computers/lightning/

http://www.cisl.ucar.edu/computers/lightning/

http://www.cisl.ucar.edu/computers/tempest/

Parallel ProgrammingParallel Programming

Parallel machines require parallel programming techniques in the user application:– MPI (Message Passing Interface) for distributed

memory systems, can also be used on shared memory systems

– OpenMP for shared memory systems– Hybrid (MPI & OpenMP) programming technique

common on the IBMs at NCAR Pure MPI parallelization often the fastest option,

computational domain is split into pieces that can communicate over the network (via messages)

OpenMP: Parallelization of (mostly) loops via compiler directives

Parallelization provided in CAM/CCSM/WRF

Most common: Hybrid hardware architecturesMost common: Hybrid hardware architectures

Combined shared and distributed memory architecture:– Shared-memory symmetric multiprocessor (SMP) nodes,

processors on a node have direct access to memory– Nodes are connected via the network (distributed memory)

MPI exampleMPI example

Processors communicate via messages

MPI ExampleMPI Example

Initialize & finalize MPI in your program via function/subroutine calls to the MPI library. Examples include:MPI_Init, MPI_Comm_rank, MPI_Comm_size, MPI_Finalize

Example fromprevious pagein C notation(unoptimized):

Important to note: such an operation (computing a global sum) is very common, therefore MPI provides a highly optimized function, also called a ‘reduction operation’ MPI_Reduce (…) that can replace the example above

Example: domain decompositions for MPIExample: domain decompositions for MPI

Each color presentsa processor

OpenMP ExampleOpenMP Example

Parallel loops via compiler directives (here: in Fortran notation)Before program is called set:setenv OMP_NUM_THREADS #proc

Add compiler directives in your code:!$OMP PARALLEL DODO i = 1, n a(i) = b(i) + c(i)END DO!$OMP END PARALLEL DO

master thread

master thread

team

Assume n=1000 & #proc=4: The loop will be split into 4 ‘threads’ that run in parallel with loop indices 1…250, 251…500, 501…750, 751…1000

SCD’s machinesSCD’s machines Bluesky (web page)

– ‘Oldest’ machine at NCAR (2002)– Lots of user experience at NCAR, easy access to help– CAM/CCSM/WRF are set up for this

architecture (Makefiles)– Batch queuing system LoadLeveler,

short interactive runs possible– Batch queues are listed under

http://www.cisl.ucar.edu/computers/bluesky/queue.charge.html

– Lots of additional software available: e.g. math libraries, graphics packages, Totalview debugger

http://www.cisl.ucar.edu/services/supercomputing.jsp


http://www.cisl.ucar.edu/docs/ibm/ref/ll.html

http://www.cisl.ucar.edu/computers/%0Bbluesky/queue.charge.html

http://www.cisl.ucar.edu/docs/totalview/

SCD’s machinesSCD’s machines

Bluevista (web page)– Newest machine on the floor (Jan. 2006)– CAM/CCSM/WRF are (probably) set up for this architecture– Batch queuing system LSF (Load Sharing Facility)– Queue names different from bluesky: premium, regular, economy,

standby, debug, sharehttp://www.cisl.ucar.edu/computers/bluevista/queue.charge.html

– Some additional software available: e.g. math libraries, Totalview debugger



http://www.cisl.ucar.edu/computers/bluevista/#doc

http://www.cisl.ucar.edu/computers/bluevista/queue.charge.html

http://www.cisl.ucar.edu/computers/bluevista/queue.charge.html

SCD’s machinesSCD’s machines

Lightning (web page)– Linux cluster– Compilers different from the

IBMs:Portland Group or Pathscale

– Batch queuing system LSF– Same queue names as

bluevista– Some support software

Tempest (web page)– for data post-processing with

yet another batch queuing system NQS

– Lots of support software– Interactive use possible



http://www.cisl.ucar.edu/docs/lightning/

http://www.cisl.ucar.edu/computers/lightning/queue.charge.html

http://www.cisl.ucar.edu/computers/lightning/#sw

Example of a LoadLeveler job scriptExample of a LoadLeveler job script

#@ class = com_rg32#@ node = 1#@ tasks_per_node = 32#@ output = out.$(jobid)#@ error = out.$(jobid)#@ job_type = parallel#@ wall_clock_limit = 00:20:00#@ network.MPI = csss,not_shared,us#@ node_usage = not_shared#@ account_no = 54042108#@ ja_report = yes#@ queue

…setenv OMP_NUM_THREADS 1

…

Parallel job with 32 MPI processes, com_reg32 queue (32-way node)

Submit the job via: llsubmit job_script

regular queue

32 MPI processesper 32-way node

Example of a LoadLeveler job scriptExample of a LoadLeveler job script

#@ class = com_ec32#@ node = 1#@ tasks_per_node = 8#@ output = out.$(jobid)#@ error = out.$(jobid)#@ job_type = parallel#@ wall_clock_limit = 00:20:00#@ network.MPI = csss,not_shared,us#@ node_usage = not_shared#@ account_no = 54042108#@ ja_report = yes#@ queue

…setenv OMP_NUM_THREADS 4

…

Hybrid parallel job with 8 MPI processes and 4 OpenMP threads

Submit the job via: llsubmit job_script

economy queue

8 MPI processesper 32-way node

Example of an LSF job script (lightning)Example of an LSF job script (lightning)

#! /bin/csh###BSUB -a 'mpich_gm'#BSUB -P 54042108#BSUB -q regular #BSUB -W 00:30#BSUB -x#BSUB -n 8#BSUB -R "span[ptile=2]"#BSUB -o fvcore_amr.out.%J#BSUB -e fvcore_amr.err.%J#BSUB -J test0.path##mpirun.lsf -v ./dycore

Parallel job with 8 MPI processes (on 4 2-way nodes)

Submit the job via: bsub < job_script

regular queue

8 MPI processes (total)

2 MPI processes per node

select on lightning

wallclock limit 30 min

name of the job (listedin the SCD Portal)

Example of an LSF job script (bluevista)Example of an LSF job script (bluevista)

#! /bin/csh###BSUB -a poe#BSUB -P 54042108#BSUB -q economy#BSUB -W 00:30#BSUB -x#BSUB -n 8#BSUB -R "span[ptile=8]"#BSUB -o fvcore_amr.out.%J#BSUB -e fvcore_amr.err.%J#BSUB -J test0.path##mpirun.lsf -v ./dycore

Parallel job with 8 MPI processes (on 1 8-way node)

Submit the job via: bsub < job_script

economy queue

select ‘poe’ on bluevista

Allows up to 8 MPI processes on a node

exclusive use (not shared)

More information on SCD’s machinesMore information on SCD’s machines

Web page: SCD’s Support and Consulting services SCD’s costomer support sometimes you even get help on

the weekends or in the evenings– Email: [email protected]– Phone: 303 497 1278 – Walk-in support at the Mesa Lab

Check out SCD’s Daily Bulletin (scheduled machine downtimes, etc.)

Subscribe to the hpcstatus mailing list (short e-mails about machine status, system updates)

mailto:[email protected]

GAU resourcesGAU resources ASP has a monthly allocation of 3850 GAUs

(General Accounting Units) A GAU is a measure for some compute time on the

supercomputers maintained by NCAR’s Scientific Computing Division (SCD):http://www.cisl.ucar.edu/

Access to these machines require – an SCD login account ([email protected] or 303-497-1225)– a GAU account (for ASP: contact Maura, otherwise

contact your division / apply for a university account)– ssh environment – and a crypto card (for secure access)

SCD contacts: Dick Valent & Mike Page (here today), Juli Rew, Siddhartha Gosh, Ginger Caldwell (GAUs)

http://www.cisl.ucar.edu/

mailto:[email protected]

http://www.cisl.ucar.edu/docs/access/access.html#ssh

http://www.cisl.ucar.edu/services/cryptocard.jsp

GAU resourcesGAU resources GAUs: Use it or lose it - strategy In ASP: We share the resource among the ASP

postdocs & graduate fellows Distribution is flexible and will be discussed

occasionally, e.g. monthly, either via meetings or e-mail discussions:

email: [email protected] GAUs are also charged for

– storing files in the Mass Storage System (MSS) – file transfers from MSS to other machines

http://www.cisl.ucar.edu/services/mss.jsp

ASP GAU accountASP GAU account

ASP GAU account number: 54042108 (also project number) Needs to be specified in the batch job scripts ASP account number is not your default account number Therefore: everybody needs a second (default) GAU account:

– divisional GAU account– so-called University account (small request form for 1500

GAUs http://www.cisl.ucar.edu/resources/compServ.html)these GAUs do not expire every month, one-time allocation

Second GAU account should be used for the accumulating MSS charges– automatic when using CAM / CCSM’s MSS option

http://www.cisl.ucar.edu/resources/compServ.html

http://www.cisl.ucar.edu/resources/compServ.html

GAU charges on SCD’s supercomputersGAU charges on SCD’s supercomputers

You are charged GAUs for how much time you use a processor (on bluesky, bluevista, lightning, tempest)

On bluesky, there are actually two formulas:– Shared-node usage:

GAUs charged = CPU hours used computer factor class charging factor

– Dedicated-node usage:GAUs charged = wallclock hours used

number of nodes used number of processors in that

node computer factor

class charging factor Slides on GAU charges: Modified from an earlier Slides on GAU charges: Modified from an earlier presentation by George Bryan, NCAR MMMpresentation by George Bryan, NCAR MMM

““Number of nodes used” andNumber of nodes used” and“Number of processors in that node”“Number of processors in that node”

Self explanatory (?) Bluesky:

– 76 8-way (processors) nodes– 25 32-way (processors) nodes

Bluevista:– 78 8-way (processors) nodes

Lightning– 128 2-way (processors) nodes

““CPU hours used” and “Wallclock CPU hours used” and “Wallclock hours used”hours used”

Measure of how long you “used” a processor NOTE: This includes all time you were allocated the

use of a processor, whether you actually used it or not

Example: you used two 8-processor nodes on bluesky. The job started at 1:00 PM and finished at 2:30 PM.

You are charged for 1.5 hrs

““Computer factor”Computer factor”

A measure of how powerful a computer is– Bluesky: 0.24– Bluevista: 0.5– Lightning: 0.34

This “levels the playing field”

““Class charging factor”Class charging factor”

Tied to queuing system: “How quickly do you want your results, and how much are you willing to pay for it?”

Current setting on all SCD supercomputers:– Premium = 1.5 (highest priority, fastest turnaround)– Regular = 1.0– Economy = 0.5– Standby = 0.1 (lowest priority, slow turnaround)

ExampleExample

Recall dedicated-node usage on bluesky GAUs charged = wallclock hours used number of

nodes used number of processors in that node computer factor class charging factor

1.5 hours using two 8-processor nodes Bluesky regular queue GAUs used = 1.5 2 8 0.24 1.0

= 5.76 GAUs In premium queue, this would be 8.64 GAUs In standby queue, this would be 0.576 GAUs

Recommendations: Queuing systemsRecommendations: Queuing systems

Check the queue before you submit any job:– If the queue is not busy, try using the standby or economy

queues The queue tends to be “emptier” evenings, weekends,

and holidays Job will start sooner when specifying a wallclock limit in

the job script (scheduler tries to ‘squeeze in’ short jobs) The less processors you request, the sooner you start Use the premium queue sparingly

– Short debug jobs (there is also a special debug queue on lightning)

– When that conference paper is due

Recommendations: Recommendations: # of processors vs. run times# of processors vs. run times

If you are using more processors, you might wait longer in the queue, but usually the actual runtime of your job is reduced

Caveat: it usually costs more GAUs Example: you run the same job, but using

– Using 8 processors, the job ran in 24 hours– Using 64 processors, the job ran in 4 hours

– 1st example used 46 GAUs– 2nd example used 61 GAUs

The Mass Storage SystemThe Mass Storage System

MSS: Mass storage system (disks and cartridges) for your big data sets

MSS connected to the SCD machines, sometimes also to divisional computers

MSS user have directories like mss:/LOGIN_NAME/ Quick online reference (mss commands):

http://www.cisl.ucar.edu/docs/mss/mss-commandlist.html You are charged GAUs for using the MSS The GAU equation for MSS is more complicated ....

GAUs charged = .0837 R + .0012 A + N (.1195 W + .2050

S) where:

– R = Gigabytes read– W = Gigabytes created or written– A = Number of disk drive or tape cartridge accesses– S = Data stored, in gigabyte-years– N = Number of copies of file: 1 if economy

reliability selected; 2 if standard reliability selected

MSS ChargesMSS Charges

Recommendations: Recommendations: The MSSThe MSS

MSS charges seem small, but they add up!

Examples: FY04 MSS usage– ACD: 24,000 of 60,000 GAUs– CGD: 94,500 of 181,000 GAUs– HAO: 22,000 of 122,000 GAUs– MMM: 34,000 of 139,000 GAUs– RAP: 32,000 of 35,000 GAUs

Recommendations: Recommendations: The MSSThe MSS

Recommendation for ASP users: – use an account in your home division or your

so-called ‘university’ account (1500 GAUs for postdocs, you need to apply) for MSS charges

– leave ASP GAUs for supercomputing

GAU Usage Strategy: 30-day and GAU Usage Strategy: 30-day and 90-day averages90-day averages

The allocation actually works through 30-day and 90-day averages

Limits: 120% for 30-day use105% for 90-day use

It is helpful to spread usage out evenly How to check GAU usage:

– Type “charges” on command line of a supercomputer– Check the “daily summary” output (next page)– SCD Portal: look for the link on SCD’s main page:





Web page: http://www.cisl.ucar.edu/dbsg/dbs/ASP/ASP 30 Day Percent = 57.0 % ASP 90 Day Percent = 48.3 %30 Day Allocation = 3850 90 Day Allocation = 1155030 Day Use = 2193 90 Day Use = 5575

90 DAY ST -- 30 DAY ST -- LAST DAY 01-NOV-05 31-DEC-05 29-JAN-06

ASP Gaus Used by Day

01-NOV-05 9.3603-NOV-05 .03 04-NOV-05 141.45

…22-JAN-06 .04 23-JAN-06 44.29 24-JAN-06 170.83 25-JAN-06 120.30 26-JAN-06 91.67 27-JAN-06 41.97 28-JAN-06 15.59 29-JAN-06 16.95

http://www.cisl.ucar.edu/dbsg/dbs/ASP/



What happens when we use too What happens when we use too many GAUs?many GAUs?

Your jobs will be thrown into a very low priority: the dreaded hold queue

It will be hard to get work doneBut, jobs will still run ASP Users: You can use more than 3850 GAUs /

month Experience says, it’s better to use too many than not

enough

What happens when we use What happens when we use too many/too few GAUs?too many/too few GAUs?

Too many: Recommendation: when the 30- and 90-day

averages are running high, use the economy or standby queue ... conserve GAUs

But, don’t worry about going over

Too few: ASP’s allocation will be cut in the long run if the

3850 GAUs per month allocation is not used

How to catch up when behindHow to catch up when behind

Be wasteful:– Use the premium queue– Use more processors than you need

Have fun:– Try something you always wanted to do, but

never had the resources

How to conserve GAUsHow to conserve GAUs

Be frugal:– Use the economy and standby queues– Use fewer processors– Use divisional GAUs (if possible) or your

‘university’ GAU account

How to share & monitor GAUs in ASPHow to share & monitor GAUs in ASP

Communicate! Occasionally, we (ASP postdocs) use the e-mail list:

[email protected]

to announce a ‘busy’ GAU period Keep watching the ASP GAU usage on the webpage

http://www.cisl.ucar.edu/dbsg/dbs/ASP/ or in the SCD Portal

Look for the SCD Portal link on the SCD page:http://www.cisl.ucar.edu/




SCD PortalSCD Portal Online tool that helps you monitor the GAU charges and the current

machine status (e.g. batch queues), display can be customized Information on the machine status requires a setup-command on

roy.scd.ucar.edu via the crypto-card access, just enter ‘scdportalkey hostname’ (e.g. lightning) after logging on with the crypto-card

At this time (Jan/31/2006) the GAU charges on bluevista are not itemized: will be included in the next release in Spring 2006

Other IBM resourcesOther IBM resources

Sources of information on the IBM machines bluesky (from the command line), batchview also works on bluevista & lightning– batchview for overview of jobs with their rankings– llq for list of all submitted jobs, no ranking– spinfo : queue limits, memory quotas on home file system and

the temporary file system /ptmp

– Useful IBM LoadLeveler keywords in the script:#@account_no=54042108 -> ASP account #@ja_report=yes -> job report (see

example on the

next page)

– Useful LoadLeveler commands: llsubmit script_file, llcancel job_id

Example: IBM Job ReportExample: IBM Job Report If selected, one email per job is sent to you at midnight,

Output on the IBM machines, here blackforest (meanwhile decommisioned):

Job Accounting - Summary Report ===============================

Operating System : blackforest AIX51 User Name (ID) : cjablono (7568) Group Name (ID) : ncar (100) Account Name : 54042108 Job Name : bf0913en.26921 Job Sequence Number : bf0913en.26921 Job Starts : 12/20/04 17:56:33 Job Ends : 12/20/04 23:26:34 Elapsed Time (Wall-Clock * #CPU): 633632 s Number of Nodes (not_shared) : 8 Number of CPUs : 32 Number of Steps : 1

IBM Job Report (continued)IBM Job Report (continued)

Charge Components Wall-clock Time : 5:30:01 Wall-clock CPU hours : 176.00889 hrs Multiplier for com_ec Queue : 0.50 Charge before Computer Factor : 88.00444 GAUs

Multiplier for computer blackforest: 0.10 Charged against Allocation : 8.80044 GAUs Project GAUs Allocated : 5000.00 GAUs Project GAUs Used, as of 12/16/04:1889.20 GAUs Division GAUs 30-Day Average : 103.3% Division GAUs 90-Day Average : 58.6%

How to increase the efficiencyHow to increase the efficiency Get a feel for the GAUs for long jobs: benchmark the application on

target machine– Run a short but relevant test problem and measure the run time

(wall clock time) via MPI commands (function MPI_WTIME) or UNIX timing commands like time or timex (output formats are shell-script dependent)

– Vary number of processors to assess the scaling– If application scales poorly, avoid using a large number of

processors (waste of GAUs), instead use smaller number with numerous restarts

– Make sure your job fits into the queue (finishes before the max. time is up)

Use compiler options, especially the optimization options In case of programming problems: the Totalview debugger can save

you days, weeks or even monthson the IBM’s: compile your program with the compiler options:-g -qfullpath -d

RestartsRestarts

Restart files are important for long simulations– Queue limits are up to 6 wallclock hours (hard

limit, job fails afterwards), then a restart becomes necessary

– Get information on the queue limits (SCD web page) and select the job’s integration time accordingly

– Restarts built into CAM/CCSM/WRF, must only be activated

– Restarts for other user applications must probably be programmed

Questions ?Questions ?

Computing Workshop for Users of NCAR’s SCD machines Christiane Jablonowski ([email protected]) NCAR ASP/SCD 31 January 2006 ML Mesa Lab, Chapman Room video.

Documents

mpi library

mpi example processors

distributed memory machines

ncar pure mpi parallelization

omp parallel

reduction operation

camccsmwrf slide

distributed memory systems