Cluster Computing Basics R D Bjornson N J Carriero CS Dept, Keck HPC Resource, YCGA .

Cluster Computing Basics

R D BjornsonN J Carriero

CS Dept, Keck HPC Resource, YCGAhttp://maguro.cs.yale.edu/mediawiki/index.php/Center_For_HPC_In_Biology_And_Biomedicine

man: Describes how to use a command.man man

help: Information about frequently used “shell” commands.info: New and improved (?) man—may provide more

details.locate: Find the location of a file (in common system

areas).which: Use to determine which version of a program will be

used by default.Note: User interface is hunt-and-peck not point-and-click!

Don’t Panic!

Accessing Louise• Run a program on your computer (“local”) to login to louise (“remote”) over a network

connection.• The local computer must be on the Yale network:

– A computer at Yale– Via VPN software– Via a login to a computer at Yale that allows external access, then login from there to louise.

• The login program must support the secure shell protocol.– Linux: ssh– Mac OS X: Use terminal or X11/xterm to create a command line session (“shell”), then ssh.– Windows: Putty + ssh or cygwin (and then pretend as if you are using Linux).

• ssh [email protected]• On first log in, if prompted for a passphrase for an ssh key, just press “enter”. In general,

unless you know what you are doing, leave ssh-related files alone (and do not change the permission on your home directory!).

• Running GUIs involves understanding and using X11. Baked-in with Linux, distributed but not installed by default with Mac OS X, and a 3rd party add on for Windows (e.g., cygwin).

mailto:[email protected]

Accessing Louise

• Use scp or sftp (part of the ssh program suite) to copy files from local to remote and back.

• rsync can be useful for keeping a local and remote file hierarchy in sync.

• wget will allow you to retrieve a file via a URL from the command line. Useful for fetching reference files from repository sites (ENSEMBL, NCBI, UCSC).

Cluster OrganizationLogin nodes

– Virtualized– Light use only

Compute nodes– Multicore, ~4GB DRAM per core. Parallel or concurrent execution is relatively

easy using the cores of one node. More work to use the cores on multiple nodes. But in either case do not assume this will happen automatically.

– Shared vs dedicated

File systems– Cluster wide (default), accessible over network– Local to node (direct connection)

Cluster Organization (Louise)

Don’t loiter in the lobby!

ssh

300+ Users.

90 compute nodes for general use.

Processor cores: 4 to 64 per

compute node

Compute-22-2

qsub

Resource Management

Need to explicitly allocate resources for computing– Interactive. For development; using interactive programs such

as MATLAB®, python or R; and/or graphic rich tools (X11 forwarding)

– Batch

Commands– qsub registers a request for resources (for X11 forwarding also

use ssh –Y for initial login)qsub -X -I nodes=1:ppn=8 -q defaultqsub FileWithOptionsAndCommands

– qstat provides information about requestsqstat -1 -n -u njc2

ToolsEditor (emacs vs vi and vim)

emacs makes it possible to work directly with files 10s to 100s of MB, explore binary files, capture shell transcripts and review them, interactively navigate the file hierarchy, review file differences, etc. .

Binary vs ASCII files.

file Basic command to determine the kind of file.od –c Displays content byte by byte, permitting a detailed examination—useful especially when dealing with DOS/Unix/Mac OS X end of line conflicts or looking for file corruption. Often used in a “pipe” with head. Btw, do not use a “wysiwyg” editor such as Word or Wordpad for technical work, especially data preparation or code development.

Toolsls , cd , mkdir: List directory contents,

change directories, make a new directory. File hierarchy = tree of directories. – A “path” is a series of nested directories written this

way /dir0/dir1/dir2/file. – When you login, you start work in your “home”

directory (aka ~). – When bash looks up a command for you, it searches in

all paths listed in the “PATH environment variable”.export PATH=/my/new/program/Directory:$PATH

– Look in “/usr/local/cluster/software/installation” for programs of interest.

Tools

head , less , tail: See a couple of lines in an ASCII file. head and tail can be used to extract a small sample, e.g. to see the format of data in the file or to create test input (but this kind of sample is generally not representative). Often used with pipes. Use less to browse files (by line number or percentage).

split: One way to cope with large files (but virtual splitting can be more efficient: split will, at least temporarily, double the amount of file space used).

awk: Swiss army knife. Can do head/tail/split and much more:awk 'NR%1000 == 13{print $0}' fullDataSet > sampleDataSet

python: An excellent general purpose text processing and analysis environment (increasingly popular, but perl has a large lead).

Tools: bash scripting, redirection and pipes

When you log into a computer you are connected to a program. This program accepts the text you type and does “something” with it. If, for example, you type “ls”, the program first determines that “ls” is not something that it directly understands, so it next looks for another program on the computer called “ls” in one of the directories in PATH. If it finds it, it runs that program on your behalf and then reports the output. If it does not find it, it reports an error to that effect.

This class of program is generally referred to as “command shells”. It should be clear that the shell plays a critical role in the use of a cluster computer, and yet most users give the shell little or no thought. This generally comes back to haunt them in the form of subtle bugs that they are ill equipped to diagnose and correct, as well as missed opportunities to streamline workflow.

Tools: bash

Consider a sequence of commands given to the bash shell (the default shell) :

unzip data.gz awk '/chr13/{print $0}' data > chr13Recordsgzip datamyProgram -i chr13Records -o chr13Filteredrm chr13Recordssort -k 2,2n < chr13Filtered > chr13Sortedrm chr13Filtered

Note: stdin, stdout, stderr

Tools: bash

An alternative using bash pipes (“|”):gunzip -c data.gz | awk '/chr13/{print $0}'| myProgram

-i - -o - | sort -k 2,2n > chr13Sorted

Three advantages: – Less file system IO (extremely important in a

cluster setting)– Less clean up (an issue when this sort of

processing is done 100s or 1000s of times)– Better use of multicore machines (gunzip, awk,

and myProgram can run concurrently).

Tools: bash

Now suppose we have 100 data sets: dataSet00.gz ... dataSet99.gz.A few notes about file naming:• When working with a large number of files, it is easy to lose

track of files or accidentally overwrite some, so choose a clear and informative scheme and stick to it. If >> 1000, use additional levels of directories.

• 0- vs 1- based indexing is a subtle point that you need to get comfortable with (you don’t have to use it yourself, but you will run into it sooner or later).

• Padding with leading 0’s compensates for dumb file sorting.How can we easily process all of these sets?

Tools: bashfor f in $(ls dataSet*.gz)do gunzip -c $f | awk '/chr13/{print $0}’|

myProgram -i - -o - | sort -k 2,2n > chr13Sorted_${f/.gz/}

done

Note: You can use an editor to create a file that contains a complex command or a command sequence and then have bash execute that file as if you typed it in directly: source CommandFile

You can also turn that file (“script”) into something that you can run like any other program.

ParallelismThat may take a while, how can we use multiple processors to do it faster?

Simple queue:

1. Produce a list of tasks to be executed (essentially the same loop as before modified to display the commands to be executed rather than actually execute them).

for f in $(ls dataSet*.gz)do echo ”cd $(pwd) && ( gunzip -c data.gz | awk '/chr13/{print $0}’ |

myProgram -i - -o - | sort -k 2,2n > chr13Sorted_${f/.gz/} ) >${f}.out 2>${f}.err”

done > Tasks

2. Create a batch script that directs the resource manager to allocate compute nodes and then uses the allocated nodes to work through the list of tasks (can “|” to qsub).

sqPBS.py default 4.6 njc2 dataExtraction Tasks

3. Check output files and status information (Simple Queue collects a great deal).4. /usr/local/cluster/software/installation/SimpleQueue/sqPBS.py

cd ... && blast ds 01







Aside: Random Number Generation

If you run a code that depends on random numbers, you must take care to ensure it does what you expect when you run it several times, perhaps concurrently on different nodes.

On the one hand, in general you will want each instance to see different random numbers. This may not happen by default.

On the other, you would like to be able to reproduce your results. Different but not too different!

Parallelism: Pre-packaged

Thread based: Fairly common ("easy"-ish). Thread-based parallelism can only make use of the cores on one node.

Message passing based (MPI, PVM, …): Less common in bioinformatics. A message passing program can make use of the aggregate resources of many nodes.

“make” based: Illumina and one or two others. Limited to the cores of one node.

Parallelism: Pre-packagedIf you are using a 3rd party program, it is important to know which kind of

parallelism is used and to invoke the program appropriately.If threaded:

I. Run on a dedicated node!II. Check docs for a number of threads parameter.

If MP, typically need to set up a special execution environment in order to run the program using the resources allocated. Unfortunately, this tends to be MPI-implementation specific and so has to be addressed on a case by case basis (ask RDB or NJC).

If “make”, invoke like this:

make -j N MakeTarget > make.out 2> make.err

where N is the number of cores to use.

Do It Yourself: Owner computesIt is possible to write you own parallel programs.

One strategy that RDB and NJC often use:• Imagine that you run multiple copies of a sequential version.• At some point, the copies will enter a period of execution in which the work can

be split up into independent tasks. Add a check to decide which copy “owns” (and should execute) a given task—all other copies will skip this task.

• Each copy records the tasks it did. When it exits the period of execution that was split up, it exchanges with all other copies the results of the tasks it did. At this point all the copies know all the results and will continue to execute as if they had each done all of the work themselves.

The devil is in the details—especially the mechanisms used to settle ownership and to exchange task results. Ask us for help; just keep in mind that this kind of parallelism is an option and need not be terribly complex.

Software as an Experimental System

Start with “small” input sets and/or run parameters and systematically alter these to study how CPU time, memory use and IO activity vary from run to run.

Non-invasive tools:top May need a separate log in to the allocated node (use intra-cluster ssh).

time command:

/usr/bin/time –v prog a0 a1 a2 > outFile 2> errFile

Output from time will be appended to “errFile”. Note: use the full path—this is an instance where it is important to understand how the shell works.

Software as an Experimental System

If you are in a position to modify code, you can get much more accurate and detailed information.

Ditto with profiling:Compile time option plus post processing for C, C++, Fortran, …Available as a runtime facility in various scripting systems (python, perl, ruby).Activating profiling often significantly increases run time, placing a premium on the importance of well designed small test cases.

Scaling Considerations

Consider the time (in arbitrary “operation” units) to process N records, if doing:A record by record transform => Time(N)An all to all comparison => Time(N2)An exploration of subsets => Time(2N)An exploration of orderings => Time(N!)

One naturally tends to focus on run time, but memory and IO (amount as well as rate) matter too.


What N corresponds to about 1 CPU second?Time(N) => 1,000,000,000Time(N2) => 30,000Time(2N) => 30Time(N!) => 13

What model applies clearly matters!


It matters when determining how big a problem is feasible. Suppose we double the input size:Time(2*1,000,000,000) => ~ 2 sTime((2*30,000)2) => ~4 sTime(2(2*30)) => 1,000,000,000 s (> 30 years)Time((2*13)!) => 1016 s (roughly a billion years)


It matters when verifying code behavior. If you have a code that you believe follows a Time(N) model, but empirically behaves like Time(N2), then you may have a bug.

For example, code that maintains a list of values can easily degenerate to Time(N2) if one is careless with the operations that maintain the list.

Other Performance Considerations

Memory hierarchy:Do as much as you can with one record before moving on to the next.

Physical vs Virtual Memory:When chunking work, size to fit in physical memory.

Local vs remote IO:If you cannot eliminate temporary IO via bash pipes or named pipes, at least

write to a local file system (but clean up!).

Bulk IO vs character IO:Mostly done for you, but avoid IO operations that read or write one byte or

character at a time.

Data IO vs metadata operations:Metadata operations are much more expensive than normal data IO. Avoid

them. E.g., don’t use a series of specially named empty files to indicate progress, write to a log file instead.

Cluster Computing Basics R D Bjornson N J Carriero CS Dept, Keck HPC Resource, YCGA .

Documents

qsub slide

compute node compute

login program

sshrelated files

putty ssh

ssh key

general use

ssh program suite