High Performance Statistical Computing

High Performance Statistical Computing with Applications in the Social Sciences

Micah Altman

Senior Research Scientist

“introduction to the RCE” by,

Earl Robert Kinney

Manager, Research Computing Environment

Institute for Quantitative Social Science

Harvard University

Goals for todayAnalysis Describe performance goals Identify resource use patterns Identify resource bottlenecks Identify performance hot-spots Select problem decomposition

Application Connect to RCE Use the RCE to analyze larger

data sets Use the RCE to run interactive

analyses more quickly Use the RCE to run large numbers

of analyses independently

M. Altman & B. KinneyHigh Perf. Stat. Computing

(v.9/10/11) 25

[Source: Wikimedia Commons]


(v.9/10/11) 3

Organization of this Workshop Motivation Principles Introduction to RCE Measuring Resource Use Scaling Up Tuning Up Scaling Out

(Parallelization) Additional Resources

Nine Steps to Faster Results1. Predict your resource needs through benchmarks,

models, algorithmic analysis

2. Select alternate algorithms when resource needs grow very rapidly with problem size

3. Identify resource bottlenecks using systems performance analysis tools

4. Address bottlenecks by increasing resources and/or changing program resource management

5. Discover hot-spots in programs using profiling tools

6. Adapt hot-spots to system architecture

7. Decompose the problem into independent subproblems

8. Distribute subproblems across pools of resources

9. Repeat analysis after making any changes


(v.9/10/11) 4

FREE! With every first class!

Coffee!Chocolate!!Consulting!!!Time off for good behavior !!!!


(v.9/10/11) 5 of 85

And now, a word from our sponsor…

6

IQSS (and affiliates) offer you support across all stages of your quantitative research:

Research design, including:design of surveys, selection of statistical methods.

Primary and secondary data collection, including:the collection of geospatial and survey data.

Data management, including:storage, cataloging, permanent archiving, and distribution.

Data analysis, including :survey consulting, statistical software training, GIS consulting, high performance research computing.

http://iq.harvard.edu/M. Altman & B. Kinney

High Perf. Stat. Computing (v.9/10/11)

http://iq.harvard.edu/

http://www.iq.harvard.edu/

But wait … there’s more!

7

The IQSS grants administration team helps with every aspect of the grant process. Contact us when you are planning your proposal.

Assisting in identifying research funding opportunities Consulting on writing proposals Assisting IQSS affiliates with:

preparation, review and submission of all grant applications

(“pre-award support”) management of their sponsored research portfolio

(“post-award support”) Interpret sponsor policies Coordinate with FAS Research Administration and the Central Office for

Sponsored Programs

… And, of course, support seminars like this!


(v.9/10/11)



(v.9/10/11) 8

“One’s Reach should exceed One’s Grasp” Leading edge statistical methods (such as

MCMC) can require lots of computing power Ensuring robust results can multiply (and re-

multiply) the number of analyses done: Sensitivity analysis Parameterization studies Alternative models, Bayesian model averaging

Performance benchmarks provides information for budgeting computing $$$

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: P

rin

cip

les


(v.9/10/11) 9

“I Want it Now!”

Deadlines abound: conferences, trials, publication dates

New observations, variables, corrections, or model specifications may necessitate speedy reanalysis

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: P

rin

cip

les


(v.9/10/11) 10

"My strength is as the strength of ten because my heart is pure." Selection of algorithms can change the

nature of the computational resource usage Tuning for a particular system can increase

performance approximately ten-fold In some circumstances work can be split

across thousands of systems.

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: P

rin

cip

les


(v.9/10/11) 11

Principles

Goals matter Problems matter Algorithms matter Answers matter Architecture matters

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: P

rin

cip

les

Types of Performance Goals Task completion time – wait time to finish Efficiency – resource use for task Throughput –work done by system overall Latency – delay before response Responsiveness – perception of response Reliability – probability task/system will fail

during time interval


(v.9/10/11) 12

“If you don’t know where you’re going, any road will take you.” – Proverb “If you come to a fork in the road, take it.” – Yogi Berra

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: P

rin

cip

les


(v.9/10/11) 13

Performance Goals – Rules of Thumb Completion time:

work (i)/resource(i) Throughput:

maximize(work/resource) for all jobs

Latency: time elapsed before first response to input

Real-time: complete task within fixed interval

“Responsiveness” Perceived latency Task completion time Task progress indicators

Rules of thumb

Users of interactive software want responsiveness

Users of batch jobs want small completion times

Systems administrators want maximum throughput, reliability

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: P

rin

cip

les

Size of Factors Affecting Performance


(v.9/10/11) 14

Run Time For Large Instance (n=1000)

NP-Hard (worst case) 10^292 years

Very Inefficient AlgorithmO(N^3)

1.6 years

Inefficient AlgorithmO(N^2)

16 hours

Very Poor Memory Access Patterns

11 hours

Un-optimized Code 67 minutes

Optimized code 7 minutes

Local Multiprocessing 2 minutes

Fully Parallel/Full Cluster

4 seconds

If runtime for solve small instance of a problem (n=10), running on single system is one minute, how long will it take to solve larger instance of n=1000?

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: P

rin

cip

les


(v.9/10/11) 15

Problem Complexity Classes

Problem complexity class: set of problems that can be solved in O(f(n)) for some f

More general than algorithmic complexity – encompasses all possible algorithms to solve the given problem

Polynomial time algorithm necessary for large problem instances

Decision Problem

Decidable Undecidable

EXPSPACE

EXPTIME

PSPACE

NPCO-NP

P = BPP(?)

BQP

NP-complete

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: P

rin

cip

les

Some Problems Are HARD Traveling Salesperson Problem (weighted Hamiltonian cycle): Plot a route through N

locations, visiting each once, that minimizes cost. NP-Hard: worst-case instances require exponential time for optimal, certain, solution NP-Complete: Equivalent to a large class of hard problems


(v.9/10/11) 16

Source: Applegate, Bixby, Chvátal, and Cook (1998) Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: P

rin

cip

les


(v.9/10/11) 17

How to “Solve” the Unsolvable Think small: Use only a small number of cities.

Aggregate to regions and treat as quasi-cities. Restrict Problem – Euclidean distances are easier

than travel cost. Solve a different problem: minimum spanning tree Approximate solution: for Euclidean distances, there

is an algorithm based on minimum spanning tree that is at most 50% longer

Randomize: can a randomized algorithm find solution with probability p? (No one knows…, probably not)

Be Lucky: maybe “average” problem isn’t that hard? Heuristics: Apply Simulated Annealing (etc.), cross

fingers

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: P

rin

cip

les

How to recognize hard problems… Is the problem routinely solved by existing

systems? Are efficient algorithms known? Does it appear in lists of hard problems? Is the problem universal?

(Any computing problem, sufficiently generalized is hard [Papadimitriou 1994])

Is run time growing exponentially in practice?


(v.9/10/11) 18

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: P

rin

cip

les



(v.9/10/11) 19

Algorithmic Complexity Measures the complexity of a particular

solution to a problem Resource complexity: a measure of the

resources used to solve a problem, as a function of input size

Common resource measures: Time, usually represented as number of

operations executed Space, usually represented as number of discrete

scalar values stored

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: P

rin

cip

les


(v.9/10/11) 20

Algorithmic Complexity: SearchbubbleSort(list) while (not finished) {

finished <- truefor i in (1 to length(list)-1){

if (list[i]>list[i+1]) { swap(list[i],list[i+1]) finished<- false}

}}

2nO

quicksort(list) if (length(list)=1) return select from (list) for x in (list) {

if x=pivot, add x to pivotListif x>pivot, add x to greaterListif x>pivot, add x to lessList

} return(quicksort(

lessList + pivotlist + greaterlist))

nnO log

*illustrations courtesy of wikipedia

Number of Operations

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: P

rin

cip

les

http://en.wikipedia.org/wiki/Image:Bubble_sort_animation.gif

Search Complexity Continued Tally sort:

Items in a fixed range, no duplicates Inlist = logical(length=max-min) For (i = 1:length(items)) {inlist[items[i]]=TRUE)} For (I in mix:max) if (inlist[i]) dowork(i)

How fast is this? Algorithm Recurse_sort(array L, i = 0, j = length(L)-1)

if L[j] < L[i] then L[i] ↔ L[j] if j - i > 1 then t = (j - i + 1)/3 Recurse_sort (L, i , j-t) Recurse_sort (L, i+t, j ) Recurse_sort (L, i , j-t) return L


(v.9/10/11) 21

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: P

rin

cip

les

Answers Matter

Before optimization, verify the answer Right can mean “right enough” if well-defined Correct code may have different performance

characteristics than incorrect code Returning wrong answer can always be done

quickly


(v.9/10/11) 22

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: P

rin

cip

les

Simple VN Architecture


(v.9/10/11) 23

Input/Output

ProcessorMemory

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: P

rin

cip

les

More Modern


(v.9/10/11) 24

Processor

Processor

RAID Controller

Processor

Memory

GPU

DiskDisk

Disk

Network Card

CoreCore

L1L1

L2 FPU FPU

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: P

rin

cip

les

Inside the Core


(v.9/10/11) 25

Processor

Processor

RAID ControllerProcessor

Memory

GPU

DiskDiskDisk

Network Card

Core

Core

L1

L1

L2FPU

FPU

© Intel

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: P

rin

cip

les

Deep Inside the Core


(v.9/10/11) 26

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: P

rin

cip

les


(v.9/10/11) 27

Resource Hierarchy: Big, Fast, Cheap*

• Big, Fast, Cheap – Pick 2• Latency increases with each step down • Storage increases• Throughput decreases

(except, with some offline storage)

Registers (<1KB)

Cache (1 MB)

Ram (10 Gigabytes)

Local Storage (10’s Terabytes)

ONLINE Storage (100’s Petabytes)

OFFLINE STORAGE ( 10’s Exabytes )

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: P

rin

cip

les


(v.9/10/11) 28

Reading One Byte: x-= m[1,3]

CPU: 8 bytes: Load Register

Cache: 256 Bytes <- Cache Line

RAM: 4K <- Page

Disk <- 8K from NFS

Networked File System

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: P

rin

cip

les


(v.9/10/11) 29

General Performance Implications of Architecture Talking to external devices can cause waits… (latency) Information transmitted to CPU is limited by bus

(throughput) In practice, expect 80% of theoretical data-path bandwidth at

best Some optimizations are highly specific to architectural

details Hidden parallelism at low levels Information travels in chunks

(at least bus size) Complexity makes theoretical performance analysis

difficult – use benchmarksHig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: P

rin

cip

les


(v.9/10/11) 30

From Principles to Practice Practice = Principles * Optimization Goals *

Problem Type * Computing Environment Optimization Goals

Throughput Latency Reliability Scaling up Scaling out

Problem Decomposition Independent data Independent calculations Coupled calculations

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: P

rin

cip

les

Principled Preparation Checklist

Verify that your problem is tractable: Substitute an easier problem Restrict or limit the problem Be lucky or clever

Establish performance goals Identify possible algorithms

What is their resource complexity? Are better algorithms known?

Identify potential system characteristics Communications costs Systems resources


(v.9/10/11) 31

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: P

rin

cip

les


Lab 0: Problem definition Define your computing

problem as formally as you can?

What algorithms are you used to solve the problem?

What are your performance goals?


(v.9/10/11) 32

[Source: http://andreymath.wikidot.com/ . Creative Commons Sharealike Licensnce]

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: P

rin

cip

les

http://andreymath.wikidot.com/

Micah Altman

Fill in LAB!

An Introduction to the IQSS RCE What is it? Why use it? How does it work? How do we use it?


(v.9/10/11) 33

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Intr

od

uct

ion

to

RC

E


What is the RCE?


(v.9/10/11) 34

Virt

ual D

eskt

op

• Full virtual desktop environment – connect anywhere

• Many research software packages available

• Persistent session – connect anytime In

tera

ctiv

e N

odes

• For large interactive jobs.

• Large amounts of memory available on demand

• Stata, Matlab, Mathmatica

Bat

ch P

roce

ssin

g • Run hundreds of jobs at once.

• Optimized for non-interactive, independent work

• Easy to run from your virtual desktop

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Intr

od

uct

ion

to

RC

E

Why use the RCE? For Research

An environment customized for quantitative social science research A wide variety of research software packages are available

Fore Convenience The RCE enables you to access a research desktop from almost any computer Sessions are persistent -- disconnect from your office, reconnect from home File storage is central. Never worry about which computer has your files

For Resources Large analysis jobs are offloaded to high-powered Large resource pools : 800 processors , 3.3 TB of memory , 40 TB of disk storage Regularly updated software

For collaboration Offers an ideal environment for collaborative research projects Share project files, desktops, software

For reliability System performance and availability is constantly monitored Research files are regularly backed up and stored securely IQSS has full time staff dedicated to the support the RCE


(v.9/10/11) 35

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Intr

od

uct

ion

to

RC

E


RCE Architecture


(v.9/10/11) 36

Batch Nodes

Interactive Nodes

Login Nodes

Virtual Desktop

SessionsClient

Disk Disk Disk Disk

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Intr

od

uct

ion

to

RC

E

RCE Architecture Rules of Thumb Connect to interactive pool Small problems – run directly

(on an interactive node) Large-memory problems use interactive

nodes Interactive problems use interactive nodes Large-compute jobs use Batch submit

-- but problem must be decomposed


(v.9/10/11) 37

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Intr

od

uct

ion

to

RC

E


RCE Powered Apps – How it Works


(v.9/10/11) 38

User clicks on applicationfrom menu

RCE checks for availability ofInteractive nodes

RCE submits special condorjob to interactive

master node.

RCE submits special condor

job to batch master node.

User receives notice, offeredbatch node to run their job.

Window appears on RCE desktop and application

runs on node.

A node is available

No node is available User hits “yes”

~30s

~120s

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Intr

od

uct

ion

to

RC

E

RCE Desktop


(v.9/10/11) 39

Application Menu – Application launching

HMDC Outage Notifier – Updates to reflect status of environment.

File Browser – Graphical view of yourhome directory and files.

Quick Launch – Quick Access to E-mail, Web, and Office applications.

Status Bar – Shows open applications.

Desktop Shortcuts – ContainsShortcuts to home directory and trash

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Intr

od

uct

ion

to

RC

E

Login Nodes


(v.9/10/11) 40

Number of servers:

8

Number of processors:

32

RAM per session:

~6 GB

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Intr

od

uct

ion

to

RC

E

Apps On Interactive Nodes


(v.9/10/11) 41

Features

Easiest way to launch applicationsLimitations

Smaller amounts of RAMCompetition for resources with interactive processes.

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Intr

od

uct

ion

to

RC

E

Interactive Nodes


(v.9/10/11) 42

Number of servers:

13

Number of processors: 84

RAM per job:

1-64GBHig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Intr

od

uct

ion

to

RC

E

Apps on Interactive Nodes


(v.9/10/11) 43

Features More memory available for

application Dedicated processor reduces

competition for resources Multiple cores available

(e.g. for Stata-MP) Limitations

Interactive nodes are limited in number

Time limit on applications (currently 72 hours)

Time can be extended by request

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Intr

od

uct

ion

to

RC

E

Batch Nodes


(v.9/10/11) 44

Number of servers:

61

Number of processors:

258

RAM per job:

2-4GB

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Intr

od

uct

ion

to

RC

E

Running Statistical Apps OnBatch Nodes


(v.9/10/11) 45

Features Nearly 400 nodes can

run at the same time Well suited for loosely-

coupled parallel problems

Limitations Memory is more limited Application must be

designed to harness the power of all node

No failover to other pools

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Intr

od

uct

ion

to

RC

E

Memory Limitations

Login Nodes Each user on the machine is

allowed to use a portion of available memory

No enforcement of login limits (can be oversubscribed)

Interactive/Batch Nodes Each node has share of

memory based on request Physical hardware will only

run number of jobs equal to processor cores (not oversubscribed)


(v.9/10/11) 46

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Intr

od

uct

ion

to

RC

E

Get started with the RCE: Checklist

Apply for an RCE account:[email protected]

Install the free NX software Connect to rce.hmdc.harvard.edu Run interactive programs with menus Run large interactive jobs with “RCE

Powered” menu Run large batch jobs using a simple

launcher script


(v.9/10/11) 47

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Intr

od

uct

ion

to

RC

E

mailto:[email protected]

Lab 1: Connecting to the RCE In this lab, we will

login to the RCE and launch stata on a Interactive node


(v.9/10/11) 48

[Source: http://andreymath.wikidot.com/ . Creative Commons Sharealike Licensnce]H

igh

Per

form

ance

Sta

tistic

al C

ompu

ting

: A

nal

yzin

g R

eso

urc

e U

se


Micah Altman

Fill in LAB!

Systems Resource Use Benchmarks Timing System resource monitoring System resource limits


(v.9/10/11) 49

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

An

alyz

ing

Res

ou

rce

Use

Benchmarks What patterns of usage are likely to occur?

What are the 80% cases? Are there 10% cases that have unusual patterns of

data access, or unusual input? Can you construct a plausible worst-case?

Parameterize benchmarks Parameterize problem size Vary order-of-magnitude

Create benchmarks based on real cases Use real problems for full benchmarking Miniaturize real problems for quick tests


(v.9/10/11) 50

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

An

alyz

ing

Res

ou

rce

Use

Common Benchmarks

Artificial benchmarks Simple “unit” benchmarks Real application + random data Real application + real data Real application + worst case data Mix of applications


(v.9/10/11) 51

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

An

alyz

ing

Res

ou

rce

Use

Timing Why measure timings

Direct or indirect measure of performance Establish baseline for changes Empirical measure of scaling

Limitations Timers are often imprecise for brief events Other activity on the system “noise” Many tools aggregate all phases of execution Many tools aggregate all areas of resource use CPU timings may exclude system resource use Must use condor_submit to run these on non-interactive

nodes Heisenbugs


(v.9/10/11) 52

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

An

alyz

ing

Res

ou

rce

Use

Alternative: Queuing Models Formalists alternative to benchmarks Can be useful for capacity planning Model services as network of queues

Different classes of “customers” Resources with different delay characteristics Transition probabilities

Distribution of “service events” Poisson events discrete, independent, no

memory Number of events are Poisson distributed

interarrival time exponentially distributed Little’s law:

Length of queue = arrival rate * time in queue Limitations

Heroic assumptions are often required State-space explosion Only simplest models solvable closed-form


(v.9/10/11) 53

Source: Takefusa, et al. 1999

Wall-Clock Time Measure completion time Show phases of execution by inserting calls


(v.9/10/11) 54

Linux: dateOS X: dateWindows: DATER: Sys.time()Stata: display "$S_TIME $S_DATE”Matlab: clock; ticC: time(), getitimer()

> print(Sys.time())

[1] "2010-04-28 10:21:45 EDT"> res <- optim(sq, distance, genseq, method="SANN",+ control = list(maxit=30000, temp=2000))> print(Sys.time())[1] "2010-04-28 10:21:55 EDT"

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

An

alyz

ing

Res

ou

rce

Use

CPU Time Measure CPU time used by program Show “system”-state and “user”-state time Some tools show other resources


(v.9/10/11) 55

Linux: /usr/bin/time –v OS X: /usr/bin/time –l Windows: timeit.exe*R: system.time()Stata: timerMatlab: cputimeC: getrusage()

$ /usr/bin/time -v \/usr/local/stata11/stata -b mycommand.do

User time (seconds): 0.00System time (seconds): 0.01Percent of CPU this job got: 64%Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.03...

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

An

alyz

ing

Res

ou

rce

Use

*Optional tool, may require installation on your system

Interpreting CPU Time

If (system)/(system + user) > .1 Possibly inefficient use of system calls, I/O

If elapsed time >> (system+user) Possible resource bottleneck Possible sleep If CPU Percent low possible CPU contention


(v.9/10/11) 56

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

An

alyz

ing

Res

ou

rce

Use

User time (seconds): 0.00System time (seconds): 0.01Percent of CPU this job got: 64%Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.03

Monitoring Running Processes Show list of processes running See current and accumulated CPU usage See CPU utilization


(v.9/10/11) 57

Linux: top; gnome-system-monitorOS X: top; Utilities -> “Activity Monitor”;

atMonitor (3rd party, highly recommended)

Windows: taskmrg.exe; top.exe *

$ gnome-system-monitor &

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

An

alyz

ing

Res

ou

rce

Use

*Optional tool, may require installation on your system

Interpreting Process Monitor Results

Show list of processes running See current and accumulated CPU & memory usage See CPU utilization


(v.9/10/11) 58


Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

An

alyz

ing

Res

ou

rce

Use

Show processes

Sort # of processeswaiting to use CPU

Sort processesBy CPU use

Sample Performance Curves


(v.9/10/11) 59

• Best case: linear in size of problem• Nonlinearities could mean…

• inefficient algorithm (case 2)• hard problem (case 3)• poor data access patterns (case 4)

System Resource Monitoring Why monitor system resources?

Identify bottlenecks Identify processes using resources – may affect overall

throughput and capacity Identify processes actively using resources – may affect

performance Limitations

Tools are often imprecise for brief events Other activity on the system “noise” Many tools aggregate all phases of execution Many tools aggregate all system use Many tools aggregate sub-resource use Must use condor_submit to run these on non-interactive nodes Heisenbugs


(v.9/10/11) 60

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

An

alyz

ing

Res

ou

rce

Use

Monitoring System Resources See system aggregated use and activity for memory, disk,

network See memory use by process See resource use by process (varies by platform)


(v.9/10/11) 61

Linux: gnome-system-monitor ; /usr/bin/time –v sar ; iostat ; vmstat

OS X: Utilities -> “Activity Monitor”/usr/bin/time –v; sar ; iostat

Windows: perfmon.exe; taskmrg.exe

$ gnome-system-monitor &$ sar –A 1 10$ /usr/bin/time –v stata –b somefile.do

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

An

alyz

ing

Res

ou

rce

Use

Detailed System Resource Tracing See system use/calls for process as it runs


(v.9/10/11) 62

Linux: strace; systap (add-on)OS X: dtraceWindows: procmon.exe (add-on)

$ strace –o strace.log myProgram Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

An

alyz

ing

Res

ou

rce

Use

$ sudo dtrace -n 'syscall:::entry { @[execname] = count() }' -c ls

Interpreting Process Memory Use

Memory – amount of virtual memory requested Resident Memory – amount of memory currently in

RAM for processM. Altman & B. Kinney

High Perf. Stat. Computing (v.9/10/11) 63


Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

An

alyz

ing

Res

ou

rce

Use

Use monitor->preferences toAdd “Resident Memory”

Memory in residence

Requestedmemory

01:37:57 PM pgpgin/s pgpgout/s fault/s majflt/s01:37:58 PM 0.00 0.00 14.71 0.00

01:37:57 PM tps rtps wtps bread/s bwrtn/s01:37:58 PM 0.00 0.00 0.00 0.00 0.00

Interpreting System Activity

Page faults – indicate memory activity or resource contention

File i/o – indicates file activity


(v.9/10/11) 64

$ sar -bB 1 10

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

An

alyz

ing

Res

ou

rce

Use

System memory activity

System disk activity

Interpreting System Activity

Page faults – indicate memory activity or resource contention

File i/o – indicates file activity


(v.9/10/11) 65

$ perfmon

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

An

alyz

ing

Res

ou

rce

Use

Major (requiring I/O) page faults: 0Minor (reclaiming a frame) page faults: 149Voluntary context switches: 1280Involuntary context switches: 460Swaps: 0File system inputs: 0File system outputs: 0

Interpreting Process Resource Use

Page faults – indicate memory activity or resource contention Voluntary context switches – indicates waiting on I/O or memory Swaps – indicates a severe system memory shortage File i/o – indicates file activity If the numbers is always 0 – it’s a lie


(v.9/10/11) 66

$ /usr/bin/time –v stata –b command

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

An

alyz

ing

Res

ou

rce

Use

Often memory related

Process disk I/O


(v.9/10/11) 67

Symptoms of CPU Bound System/Problem

CPU User+Sys activity near 100% while there are active processes (if # of procs > # of cpus)

Performance curve for your problem is continuous This is usually good

CPU is most expensive resource You can trust code profiling reports More likely to have gains from parallelization

However, if CPU %sys is high suspect inefficient use of system calls, or borderline I.o or memory bottlenecks

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

An

alyz

ing

Res

ou

rce

Use


(v.9/10/11) 68

Symptoms of Resource Bottleneck Memory Bottlenecks:

Severe: processes in swap queue (or wait on swap) lots of space in use (see swap –m), swapping activity, free memory low

Moderate: high context switches + high page (validity) faults + active processes with memory >> resident memory

I/O Bottlenecks: Moderate:

high % sys activity in CPU, high # of system calls, # interrupts Severe

I/O rate high Context switches, wait on I/O, or processes sleeping on I/O Physical disk activity high

Performance curve Discontinuous regions of accelerated performance decline

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

An

alyz

ing

Res

ou

rce

Use


(v.9/10/11) 69

Tune against bottlenecks Typically, a single resource will be

the bottleneck point: CPU Memory I/O: Graphics, Network, Disk

If you don’t address the bottleneck, optimizations elsewhere won’t matter

Bottlenecks may depend on usage scenario and phase of operation

Fixing one bottleneck may reveal others

Don’t expect speedup of the entire program to be proportional to the code you just tuned!

Programs interact, try to profile on a quiet system first

Hig

h P

erfo

rman

ce C

ompu

ting:

Sca

ling

Up

Resource Analysis: Checklist

Identify benchmarks Small instances of your problem Can vary size

Target an isolated system Minimize other activity

Time benchmarks at various sizes Monitor systems resources Look for non-linearities in performance curve Look for bottlenecks


(v.9/10/11) 70

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

An

alyz

ing

Res

ou

rce

Use

Lab: Analyzing Resource Use In this lab, we will

login to the RCE and run a simple set of benchmarks

Use timing tools and performance analysis

Identify bottlenecks and performance curves


(v.9/10/11) 71


igh

Per

form

ance

Sta

tistic

al C

ompu

ting

: A

nal

yzin

g R

eso

urc

e U

se


Micah Altman

Fill in LAB!

Scaling Up Addressing resource bottlenecks System and application limits Storing/accessing large datasets Visualizing large datasets


(v.9/10/11) 72

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Sca

ling

Up


(v.9/10/11) 73

When to Scale Up If resource analysis identifies a memory bottleneck If resource analysis identifies an I/O bottleneck

(maybe …) If problem size prevents program from starting If program crashes or hangs in the middle of

solving large problems (maybe…) If planning ahead for significant usage changes:

- size of problem data >

~1/2 available physical memory (RAM)- change of algorithm- change of data structure

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Sca

ling

Up


(v.9/10/11) 74

Addressing Memory Bottlenecks Review: Symptoms of memory bottleneck

Discontinuity in performance curve Memory size of process increasing Resident memory size of process relatively large System activity shows memory activity

Principals of addressing memory bottlenecks Memory hierarchy Locality of reference

Programming patterns Add more resources Modify data types Modify data structures Modify algorithms

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Sca

ling

Up

Memory Hierarchy


(v.9/10/11) 75

Registers (<1KB)

Cache (1 MB)

Ram (10 Gigabytes)

Local Storage (10’s Terabytes)

ONLINE Storage (100’s Petabytes)

OFFLINE STORAGE ( 10’s Exabytes )

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Sca

ling

Up

If a register access took a second, tape access would take a few centuries..

Buy one, get 8092 free!

Locality of reference Temporal locality: reuse same data elements Spatial locality: use elements that are “near” each-other in

memory What is “near”?

For vectors and files: sequential ordering For matrices: either row or column ordering

depends on language For complex data structures:

use experimentation and analysis


(v.9/10/11) 76

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Sca

ling

Up

Row-Major Order

Adding More Resources “$$$” Optimization Buy more memory, or…

use the RCE to request a larger share This is effective if local set size < share size


(v.9/10/11) 77

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Sca

ling

Up

System and Application Resource Limits Limits imposed by system or application Virtual Memory

Logical memory space for process Virtual memory limits maximum size of memory requrested Can prevent program from starting, or loading large data

Physical Memory Physical RAM installed in system Usually smaller than VM, but not always Maximum efficient local set

Resident Size Limits Affects maximum efficient local set

- not as severely as physical limits


(v.9/10/11) 78

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Sca

ling

Up


(v.9/10/11) 79

Limits in Linux and OS X Where limits are set:

Set at bootup Set by system at login – group/user level total memory limits Set in shell at process creation – request new limit (up to user

maximum) Set in code via setrlimit Set in application

Know your limits Linux/OS X: /usr/bin/ulimit –a R: none for Linux Stata: query memory

Limits on 32 v. 64 bit systems 32 bit OS has limit of 4GB for virtual & physical memory 64 bit OS

No practical limit on virtual memory Physical memory still limited by hardware configuration and design Data structures may require more memory to store, since pointers

and default data types are larger

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Sca

ling

Up


(v.9/10/11) 80 of 85

Limits in windows systems Where limits are set:

Limits implied by configuration at boot Virtual memory typically depends on configured paging space on disk + pagefile

R: memory.limit() Limits on 32 v. 64 bit systems

Most 32 bit windows OS has limit of 3GB physical memory 32 Bit addressing allows 4GB, but 1GB reserved for memory mapped hardware,

so only 3GB left over in most Windows configurations 64 bit OS

No practical limit on virtual memory (8 TB) Physical memory still limited by hardware configuration and design Data structures may require more memory to store, since pointers and default data

types are larger Some windows applications are 32-bit versions, so still limited to 4GB virtual

memory.

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Sca

ling

Up

Basic Memory Management in Statistical Software


(v.9/10/11) 81

Matlab R Stata

Memory Limit --- memory.size()[windows only]

set memory

Remove objects CLEAR rm() clear

Shrink data types

--- as.integer(real_val)as.factor(string_val)

compress

Measure data size

object.size()gc()

memory

Order for virtual memory

PACK gc() set virtual

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Sca

ling

Up

Data Structures and Memory In some circumstances, few

data points * lots of dimensions = huge matrices

Alternate representations can help Sparse matrix Network/graph Linked list Hash

Matlab and R offer these Stata doesn’t offer much

control over this M. Altman & B. Kinney

High Perf. Stat. Computing (v.9/10/11) 82

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Sca

ling

Up


(v.9/10/11) 83

Algorithmic Approaches for Memory Bottlenecks Software Default memory model

R in-memory by default. Can work around through DBI() interfaces, chunking data.

SAS/SPSS cross-tabs, descriptive stats, correlations can be done without large memory, GLM etc. is done in memory

Numerical Python/Perl Data Language/Lapack libraries: for large matrix manipulation

MySQL/PostGres – most analyses done streaming, with small memory size

Avoid in-Memory Duplication:

Dataframe.new<-bigData; dataFrame[x,y]<-0 # duplicate storageX<-1:1000; x[2000]=0 # duplication by extensionXhuge<- Xhuge*2 # temporary copy needed

Use memory efficient options for MLE, optimization,(e.g. Conjugate Gradient method for optim() , avoid Hessians)

Use memory-friendly techniques Blocking Divide and conquer

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Sca

ling

Up

Example: Computing Variance


(v.9/10/11) 84

Source: Wikipedia

def two_pass_variance(data): n = 0 sum1 = 0 for x in data: n = n + 1 sum1 = sum1 + x mean = sum1/n sum2 = 0 for x in data: sum2 = sum2 + (x - mean)**2 variance = sum2/(n - 1) return variance

def naive_variance(data): n = 0 Sum = 0 Sum_sqr = 0 for x in data: n = n + 1 Sum = Sum + x Sum_sqr = Sum_sqr + x*x mean = Sum/n variance = (Sum_sqr - Sum*mean)/(n - 1) return variance

def online_variance(data): n = 0 mean = 0 M2 = 0 for x in data: n = n + 1 delta = x - mean mean = mean + delta/n M2 = M2 + delta*(x – mean) variance_n = M2/n variance = M2/(n - 1) return variance

Two Pass AlgorithmOne-PassNumerically Unstable

One-Pass(Accurate)

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Sca

ling

Up


(v.9/10/11) 85

Storing Large Data: Compression Advantages:

Reduces size on disk Can decrease load times

(disk is slow, CPU can keep up with decompression) Disadvantages

Generally does not change size in working memory (compressed data structures are rare)

Less efficient for data that will be updated(need to write a separate copy of file)

Tools Separate stage: gzip/zip Pipes avoid creating separate copies of extracted files: gzip/gunzip –c Access compressed data directly in R: gzfile()H

igh

Per

form

ance

Sta

tistic

al C

ompu

ting

: S

calin

g U

p


(v.9/10/11) 86

Storing large data: data formats Making the most of formats

Use storage features: binary storage, internal compression Choose efficient data types: integers, numeric, factor; avoid character, custom types Partition sets of variables into different data frames by hand

Key-Value based Data Storage Details

B-tree based or hash file (e.g., gdbm, berkeley db) Key-value pair used for storage/quick retrieval Most efficient for multiple-reader of arbitrary values Less efficient for frequent updates, continuous reads

Tools – See R task view Scientific data formats

Details Common formats: HDF, NETCDF Built for complex hierarchical data structures Efficient storage/updates of entire variables from within datasets (uses b-trees)

Tools Require external support libraries R packages for ncdf/cdf/hdfH

igh

Per

form

ance

Sta

tistic

al C

ompu

ting

: S

calin

g U

p

Large Data: Analysis Virtual Memory

Configure system with lots of virtual memory – swap space (not currently RCE option)

Use R/Stata settings for better VM management Locality of reference critical

Subsampling Good for means and moments May miss outliers Stratify on outliers if oossible

Data Reduction With External Tools Partition data horizontally/vertically Reduce measurements (e.g. code text) Convert data structures

Data streaming/data blocking Can algorithm compute in one pass? Can algorithm compute in blocks of arbitrary size?


(v.9/10/11) 87

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Sca

ling

Up


(v.9/10/11) 88

Storing Large Data: DatabasesSystematic storage

Studies organized into databases Database partitioned into multiple

tables Tables contain sets of variables Data from multiple tables join’ed on

extractionTools

Open Source: MySQL, PostGreSQL

Commercial: Oracle Desktop: MSAccess R: DBI package Perl: DBI package Java: JDBC

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Sca

ling

Up


(v.9/10/11) 89

Optimizations for Databases: Basics Table layout:

Layout of data in tables Appropriate normalization

SQL statements: SQL is Declarative, not procedural Stick to relational model as much as possible

(e.g., no functions, ordering in selection rules, only in final results) Put database query optimizer to work for you

Database Structures Affecting Performance Indexes: Optimizes ad-hoc row-wise retrieval Cursors: Optimize retrieving a subset of data incrementally Materialized Views: predefined subset of data for retrieval

Performance Tuning Data types: Native types fastest, variable-length character fields, large objects can

affect performance Read vs updates:

Reads do not require exclusive locks (generally) Writes require exclusive locks. If lots of writes, use row-level locks. Indexes help reads, hurt writes

Some summary statistics can be computed in SQL, regardless of main memory limitations in R

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Sca

ling

Up


(v.9/10/11) 90

Optimizations for Databases: Gotchas Does not save disk space (often more space needed for

indices) Difficult to index ad-hoc retrieval of variables – efficiency

depends on fixed table decomposition Limited data types & numeric precision Can be much slower than storing in R (etc.) for data that

would fit in main memory If doing analysis in some other program, still need to

manage size of extracted subsets to avoid memory limitations

Overhead for access and indexing: indexing always slows down updates to database

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Sca

ling

Up


(v.9/10/11) 91

Large Data Analysis: Graphics Types of graphical issues

Too many points Too many categories

Some Useful Tools: CRAN Large graphics taskview:

cran.r-project.org/web/views/Graphics.html Mondrian: rosuda.org/Mondrian/ GGobi: www.ggobi.org/

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Sca

ling

Up

http://cran.r-project.org/web/views/Graphics.html

http://rosuda.org/Mondrian/

http://www.ggobi.org/


(v.9/10/11) 92

Large Data Analysis: Graphics Use subsamples -

good for moments of distribution not good for tail behavior, outliers

Interactive Graphics: brushing & logical zooming essential Workarounds for performance problems:

Avoid real-time binning Reduce number of simultaneous displays Avoid overlapping display windows Pause dynamic graphics, brushing, while positioning

Overplotting Pixel-size glyphs Alpha-blending Data aggregation: binning, (pre)-clustering, squashing,

(tableplots) Screen -binning Redlining Kernel smoothing

Areal plots Display is independent of number of observations Use logical zooming for large numbers of categories Use redlining Examples: histograms, mosaic plots, fluctuation diagrams

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Sca

ling

Up

Scaling Up: Checklist


(v.9/10/11) 93

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Sca

ling

Up

Perform resource usage analysis first Pay attention to locality of reference, memory

hierarchy Examine resource limits Consider adding memory resources Consider data types, program memory management Consider data structures Modify algorithms: blocking, divide and conquer Consider a database / data streaming algorithms Use visualization techniques suitable for large data

BREAK


(v.9/10/11) 94


igh

Per

form

ance

Sta

tistic

al C

ompu

ting:

Tu

nin

g U

p


Micah Altman

Fill in LAB!

Tuning Up Profiling Analysis Tuning Strategies


(v.9/10/11) 95

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: T

un

ing

Up


(v.9/10/11) 96

Review Analysis Strategies

Get the answer right (or as right as you need), first Use appropriate algorithms Identify use-case based benchmarks Look for system bottlenecks Examine processes Profile applications Tune small, expensive regions of code

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: T

un

ing

Up


(v.9/10/11) 97

Profiling methodologies– Wall Clock Usually instrumented ad-hoc Very easy to do for very small programs Very coarse grained Selects on dependent variable Stata: timer R: trace(glm,

tracer=quote(print(Sys.time())),exit=quote(print(Sys.time()))

)

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: T

un

ing

Up

PC-Counter Sampling PC-counter sampling

Automatically inserts interrupt handler into code Timer fires interrupt regularly Interrupt handler builds up statistical picture of runtime behavior Uses “call graph” to determine where program is spending time Useful for identifying CPU intensive functionsL

Limitations Does not include bottlenecks that cause process to sleep/block/swap

(i.e. waiting for a file open/ network read to complete, swapping out large chunks of memory)

Somewhat imprecise – need larger runtimes (10’s of seconds should be fine)

Need representative input data May be limited in reporting details of calling sequence Usually limited to function calls, won’t show details on blocks of code Heisenbugs

Tools: C: gprof General: oprofile R: Rprof() ; RprofMem(); proftools Stata: profiler Matlab: profile


(v.9/10/11) 98

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: T

un

ing

Up

Other Profiling Methods Basic block counts

Pros: precise Cons: very limited availability, doesn’t include system

latencies, system call times, Hardware counts

Pros: precise Cons: limited availability


(v.9/10/11) 99

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: T

un

ing

Up


(v.9/10/11) 100

Example (step 1: benchmark)

Report that analysis of drug abuse study was ‘slow’ Formulates some possible cases:

Select a {1,10,100,1000} variables, and {no, simple criteria, complex} row selection criteria on {10MB,100MB,1GB} dataset for {descriptives, boxplots, q-q plots, tab-delimited, stata, R}

Clocked a simple reduced benchmark case {1 var, 100MB, all rows, descriptives}

With a stopwatch: 5 minutes and browser timed out

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: T

un

ing

Up


(v.9/10/11) 101

Example (step 2.1: bottleneck)

[maltman@login ~/rtest]$ vmstat -n 5 procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 1 0 0 60 2064 111036 489164 0 0 45 1 39 19 3 1 41 1 0 0 60 2052 111036 489164 0 0 0 1 121 101 100 0 0 1 0 0 60 2052 111036 489164 0 0 0 1 121 101 100 0 0 1 0 0 60 2052 111036 489164 0 0 0 1 120 98 100 0 0…

[maltman@login ~/rtest]$ netstat -iacKernel Interface tableIface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flgeth0 1500 0 12284461 0 0 0 57704048 0 0 0 BRUlo 3924 0 896898 0 0 0 896898 0 0 0 LRU

Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flgeth0 1500 0 12284464 0 0 0 57704049 0 0 0 BRUlo 3924 0 896898 0 0 0 896898 0 0 0 LRU

‘vmstat’ shows that CPU is bottleneck, although interrupts and context switches are suspiciously high as well:

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: T

un

ing

Up


(v.9/10/11) 102

Example (step 3: processes)

I used ‘top’ to look at progress of request:

2:27pm up 121 days, 1:23, 4 users, load average: 1.14, 0.42, 0.15313 processes: 311 sleeping, 2 running, 0 zombie, 0 stoppedCPU states: 0.2% user, 22.6% system, 39.0% nice, 38.0% idleMem: 1048092K av, 887652K used, 160440K free, 0K shrd, 98628K buffSwap: 530104K av, 4848K used, 525256K free 402456K cached PID USER PRI NI SIZE RSS SHARE STAT LIB %CPU %MEM TIME COMMAND18336 nobody 19 19 786700 700M 1584 RN 0 96.6 53.7 30:08 R.bin

Hig

h P

erfo

rman

ce C

ompu

ting:

Spe

edin

g U

p


(v.9/10/11) 103

User time (seconds): 357.04 System time (seconds): 3.51 Percent of CPU this job got: 94% Elapsed (wall clock) time (h:mm:ss or m:ss): 6:23.02 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 0 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 907 Minor (reclaiming a frame) page faults: 229112 Voluntary context switches 0 Involuntary context switches: 0 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0

Example (step 4: single process timing)

- Time –v output shows process is CPU bound. -No discrepancy between wall-clock and CPU time.

I used ‘time –v R …’ to look at a smaller example:

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: T

un

ing

Up


(v.9/10/11) 104

Example (step 4.1: mini-profile)I cooked a short R test program, using a subset of the original data, timing calls, and simpler stat analysis:date()

y<-read.table("da6",header=TRUE,row.names=,sep="\t")date()summary(y$CLIDNO)date()q()

% /usr/bin/time –v R –no-save < R.example[1] "Wed Dec 26 14:47:22 2001"> y<-read.table("da6",header=TRUE,row.names=,sep="\t")> date()[1] "Wed Dec 26 14:53:44 2001"> summary(y$CLIDNO) Min. 1st Qu. Median Mean 3rd Qu. Max. 11 5807 9999 8119 9999 9999> date()[1] "Wed Dec 26 14:53:44 2001"> q()

The internal timing calls alone show that 99%+ of the time was spent in read.table()

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: T

un

ing

Up


(v.9/10/11) 105

Example (step 4.2: test)

- Pre-slicing the data using ‘cut’ to produce only the variable being analyzed reduced read.table() time to a few seconds.

- But the real code differed from my benchmark

- the real problem was reduced from >30 minutes to 5 minutes, almost all of it still in R

- summary() in R only took .1 seconds on the same data, so something else was going on in the R codeH

igh

Per

form

ance

Sta

tistic

al C

ompu

ting:

Tu

nin

g U

p


(v.9/10/11) 106

Example (step 4.3: profile)

‘Rprof’ showed that the mode() operation was taking an inordinate amount of time. We decided to eliminate that statistic.

Later R developers fixed this, by using a more efficient sort algorithm % total % self total seconds self seconds name 99.29 272.92 0.01 0.02 "univarStat"

98.74 271.40 0.04 0.12 "statMode"… 98.56 270.90 0.01 0.02 "table" 98.49 270.70 0.17 0.46 "factor" 60.63 166.64 60.63 166.64 "match" 37.93 104.26 0.23 0.64 "sort" 37.62 103.40 37.62 103.40 "unique" 37.62 103.40 0.00 0.00 "inherits" 37.62 103.40 0.00 0.00 "is.factor" 0.39 1.08 0.00 0.00 "[<-"

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: T

un

ing

Up


(v.9/10/11) 107

Identify Small & Expensive Regions of Code To double the speed of a program, the code you

tune has to be > 50% of the execution time ‘Optimizing’ large amounts of code is ineffective

and hard to maintain If code takes only a small %age of run time,

there is risk of ‘overtuning’ (tuning to the specific situation in a way that does not have general benefits)

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: T

un

ing

Up


(v.9/10/11) 108

Tuning Strategies: General Waste not want not… Put off until tomorrow… Pre-processing Efficient/Approximate/Randomized/Optimistic algorithms Appropriate Data structures Find the smallest chunk of code that takes the biggest

share of time Never tune without profiling It’s hard to beat the professionals

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: T

un

ing

Up


(v.9/10/11) 109

Tuning Strategies: CPU CPU

Interpreted code is slower than compiled code Use Compiler optimizations/ select better compiler

Compiler can take advantage of low level memory hierarchy details

Compiler can take advantage of instruction pipelining (overlapping instructions in the CPU)

Cache size Arithmetic precision In-lining code Pre-compute tables/cache computed values Avoid numerous floating point exceptions

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: T

un

ing

Up


(v.9/10/11) 110

Tuning Strategies- Resources I/O

Correct order Correct chunk size Locality of reference Pre-fetching

Memory Avoid duplication Efficient access order Locality of reference – process in

contiguous blocks Network

Correct chunk size Avoid frequent name resolution Use appropriate protocol Use a local address if you got

it…

Hig

h P

erfo

rman

ce C

ompu

ting:

Spe

edin

g U

p


(v.9/10/11) 111

Tuning Tips: R Use Rprof/Rprofmem/Proftools R is optimized for whole data objects:

Vector/matrix operations much faster than explicit loops Use replacement methods to update objects

R performance tips As in previous slides – avoid unnecessary duplication of objects through

assignments, growing objects, etc. In R -- changing any object copies it Remove attributes from large objects if possible Manage intermediate results

save it if you need it reuse/remove if not (or avoid naming results at all)

Compiled code for performance gains Built in functions based on .Internal() R functions much faster than R code Use C++ for small, time-consuming operations such as inner loops Link with optimized libraries (such as ATLAS) for matrix algebra

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: T

un

ing

Up


(v.9/10/11) 112

Example: Matrix Marginsse1<-function(m) { sumvector=numeric(dim(m)[2]) for (i in (1:dim(m)[1])) { for ( j in (1:dim(m)[2]) ) {

sumvector[j]<-sumvector[j]+m[i,j] }

} return(sumvector)}

se2<-function(m) { sumvector=numeric(dim(m)[2]) for (j in (1:dim(m)[2])) { for ( i in (1:dim(m)[1]) ) {

sumvector[j]<-sumvector[j]+m[i,j] }


se3<-function(m) { sumvector=numeric(dim(m)[2]) for (j in (1:dim(m)[2])) { sumvector[j]<-sum(m[,j]) } return(sumvector)}

se4<-function(m) { apply(m,2,sum)}

se1a<-function(m) { sumvector=numeric(dim(m)[1]) for (i in (1:dim(m)[2])) { for ( j in (1:dim(m)[1]) ) {

sumvector[j]<-sumvector[j]+m[j,i] }


se2a<-function(m) { sumvector=numeric(dim(m)[1]) for (j in (1:dim(m)[1])) { for ( i in (1:dim(m)[2]) ) {

sumvector[j]<-sumvector[j]+m[j,i] }


se3a<-function(m) { sumvector=numeric(dim(m)[1]) for (j in (1:dim(m)[1])) { sumvector[j]<-sum(m[j,]) } return(sumvector)}

se4a<-function(m) { apply(m,1,sum)}

mm<-matrix(runif(5000^2),nrow=5000)> system.time(res1<-se1(mm)) user system elapsed 437.48 0.36 512.83 > system.time(res2<-se2(mm)) user system elapsed 412.16 0.29 464.53 > system.time(res3<-se3(mm)) user system elapsed 2.09 0.24 8.77 > system.time(res4<-se4(mm)) user system elapsed 2.99 0.38 3.68 > system.time(res1a<-se1a(mm)) user system elapsed 427.87 0.34 478.53 > system.time(res2a<-se2a(mm)) user system elapsed 436.04 0.36 473.20 > system.time(res3a<-se3a(mm)) user system elapsed 2.03 0.35 2.46 > system.time(res4a<-se4a(mm)) user system elapsed 3.54 0.45 4.17 > system.time(res4a<-se4a(mm)) user system elapsed 3.67 0.53 4.40

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: T

un

ing

Up


(v.9/10/11) 113

Tradeoffs

Efficiency/generality Efficiency/programming effort (bugs) Time/space Throughput/completion time/latency

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: T

un

ing

Up


(v.9/10/11) 114

Premature Optimization is The Root of (Much) Evil

Does it run at all? Does it produce the right answer? Which resource is the bottleneck? What part of the program accesses that resource

most? Is the algorithm efficient? Is it easy to buy new hardware, or run pieces

independently on lots of systems?… then optimize… then parallelize

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: T

un

ing

Up

Tuning Up: Checklist


(v.9/10/11) 115

First Perform resource analysis Select algorithms

Profile code Use representative test cases Look for small regions of expensive code Replace those regions with compiled,

professionally optimized code, such as a built-in method from your stat package

Tuning Reduce expensive operations Trade time for space with indices

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: T

un

ing

Up

Lab: Scaling Up Profile the previous

example What could you do

to tune it?


(v.9/10/11) 116


igh

Per

form

ance

Sta

tistic

al C

ompu

ting

: S

calin

g U

p


Micah Altman

Fill in LAB!


(v.9/10/11) 117

Scaling Out -- Basics

Why parallelize? Sources of parallelism Limits

Amdahl’s Law Communication Costs Contention

Decomposing tasks correctly: Arbitrary interleaving Safety Deadlock Avoidance Starvation

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

Micah Altman

- Split into two sections: parallel fundamentals/decomp & abstract frameworks VS implementations (open MP, condor, R packages etc)


(v.9/10/11) 118

Why Parallelize?

Time vs. Money Most computers are idle most of the time Be the first on the block to have your own

compute cluster

but… Parallelization can decrease throughput Often takes considerable effort Difficult to ensure correct operation

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

Sources of Parallelism Intra-core

Pipelining: execute different instruction at each pipeline stage

Hyperthreading: very fast switching between two processes, when one is “stalled” in pipeline

SSE : specialized vector operations on 8 data points simultaneously

FPU: floating point math done in separate unit

Intra-system GPU: specialized simultaneous updates of

large matrices Multi-core: shared memory, independent

programs Distributed

Separate computers Connected by network “Shared nothing”


(v.9/10/11) 119

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

Processor

Processor

RAID Controller

Processor

Memory

GPU

DiskDiskDisk

Network Card

Core

Core

L1

L1

L2

FPU

FPU


(v.9/10/11) 120

Ahmdahl’s ‘Law’ Most code is not 100%

parallelizable/tunable Speedup limited by percentage

of time non-tunable requires Implication 1: may not be worth

the effort of tuning if tunable code takes up less than 80% of current runtime

Implication 2: most programs will not benefit significantly from running on more than 4 processors

Implication 3: if communication & synchronization costs are high, problem may take longer to solve

np

p1

1

0.0 0.2 0.4 0.6 0.8 1.0

1.0

1.5

2.0

2.5

3.0

3.5

4.0

percent parallelizable

spee

dup

(4

pro

cess

ors

)

0.0 0.2 0.4 0.6 0.8 1.0

05

10

15

20

25

30

x

ah

md

ah

l(x,

n =

32

)

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut


(v.9/10/11) 121

Resource Contention

Contention = too many requests for same resources E.g. : 10 processors writing to same block of

memory, disk file, etc. Use caching ( for reads), data blocking Most common on SMP/multi-core

architectures

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut


(v.9/10/11) 122

Synchronization When processes run in parallel, the

operations could be arbitrarily interleaved

Isolate code that cannot be interleaved in critical regions

Use synchronization to coordinate access to critical regions

A race condition results when the final answer depends on the order in which operations execute.

Race conditions are almost impossible difficult to debug by trial-and-error

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut


(v.9/10/11) 123

Race Condition ExampleATM_Withdrawal_loop () { while (1) {

// wait for loginaccountNum=waitForLogin()// get account balance from databasecurBalance = get_balance(accountNum)// ask user for amountamount = get_withdrawal(curBalance)newBalance = curBalance-amountset_balance(accountNum,newBalance)dispense_cash(amount)

}}

Meanwhile at some other locationIdentical account is used!…

// wait for loginaccountNum=waitForLogin()// get account balance from databasecurBalance = get_balance(accountNum)// ask user for amountamount = get_withdrawal(curBalance)newBalance = curBalance-amountset_balance(accountNum,newBalance)dispense_cash(amount)

Free Money Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut


(v.9/10/11) 124

Synchronization by Hand – First Try

While(1) { // DO NON-CRITICAL STUFF

// SYNCHRONIZE while (flag == TRUE); flag=TRUE

//DO CRITICAL STUFF

//SYNCHRONIZE Flag=FALSE}

Meanwhile... // SYNCHRONIZE while (flag == TRUE); // OOPS //DO CRITICAL STUFF

Ok in serial, not safe in parallel

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut


(v.9/10/11) 125

Synchronization by Hand – Second Try

PROCESS 0While(1) { // DO NON-CRITICAL STUFF

// SYNCHRONIZE while (turn == 0);

//DO CRITICAL STUFF

//SYNCHRONIZE turn=1}

Safe, but … Processes rigidly alternate If process never needs to enter critical region –

other process starves

PROCESS 1While(1) { // DO NON-CRITICAL STUFF

// SYNCHRONIZE while (turn == 1);

//DO CRITICAL STUFF

//SYNCHRONIZE turn=0}

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut


(v.9/10/11) 126

Synchronization -- Peterson’s Algorithm

PROCESS 0while(1) { // DO NON-CRITICAL STUFF

// SYNCHRONIZE flag[0] = 1 turn=1 while (flag[1] && turn== 1);

//DO CRITICAL STUFF

//SYNCHRONIZE flag[0] = 0}

Use multiple flags (all initialized to 0) (Simplification of Dekker’s algorithm) Safe, and fair

PROCESS 1while(1) { // DO NON-CRITICAL STUFF

// SYNCHRONIZE flag[1] = 1 turn=0 while (flag[0] && turn== 0);

//DO CRITICAL STUFF

//SYNCHRONIZE flag[1] = 0}

INITIALIZATION: flag[1]=0; flag[2]=0; turn=0

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut


(v.9/10/11) 127

Synchronization Building Blocks Tasks: Independent execution sequences Test-and-set:

atomic test and makes safe// SYNCHRONIZATIONwhile (testAndSet(flag));// Critical region// SYNCHRONIZATIONflag = FALSE

Locks: test-and-set + automatic wait loop

Semaphores:Countdown – N processes allowed in critical region+ Queue – when one process releases the semaphoe all get a turn

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut


(v.9/10/11) 128

No Race Condition -- SynchronizedATM_Withdrawal_loop () { while (1) {

// wait for loginaccountNum=waitForLogin()

// SYNCHRONIZATIONwaitForLock(accountNum)// BEGIN CRITICAL REGION

// get account balance from databasecurBalance = get_balance(accountNum)// ask user for amountamount = get_withdrawal(curBalance)newBalance = curBalance-amountset_balance(accountNum,newBalance)

// SYNCHRONIZATIONreleaseLock(accountNum)dispense_cash(amount)

}}

Meanwhile …

accountNum=waitForLogin()

// SYNCHRONIZATION

// Waiting ............

waitForLock(accountNum)// BEGIN CRITICAL REGION

// get account balance from databasecurBalance = get_balance(accountNum)// ask user for amountamount = get_withdrawal(curBalance)newBalance = curBalance-amountset_balance(accountNum,newBalance)

// SYNCHRONIZATIONreleaseLock(accountNum)dispense_cash(amount)

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut


(v.9/10/11) 129

Deadlock

Deadlock occurs when all processes are waiting for resources held by other processes

Starvation occurs when one or more processes receives no resources over time

Process2

Process3

Process1

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut


(v.9/10/11) 130

Deadlock Example

Try simultaneously: A->B B->C C->A

ATM_transfer_loop () { while (1) {

// wait for loginaccountNum = waitForLogin()accountNum2 = askForAccount()

// SYNCHRONIZATIONwaitForLock(accountNum)waitForLock(accountNum2)// BEGIN CRITICAL REGION// get account balance from databasecurBalance=get_balance(accountNum)curBalance2=get_balance(accountNum2)// ask user for amountamount = get_withdrawal(curBalance)newBalance = curBalance-amountnewBalance2 = curBalance+amountset_balance(accountNum,newBalance)set_balance(accountNum2,newBalance2)// SYNCHRONIZATIONreleaseLock(accountNum)releaseLock(accountNum2)

}}

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut


(v.9/10/11) 131

Avoiding Deadlock

Timeouts Release locks after waiting for some set period Simple to implement, pragmatic Cons:

Inefficient Programs need to get all locks at once before entering any critical regions, or release of lock

may be unsafe Timeout values may need tuning, should be somewhat randomized (to avoid cycling)

Resource graph analysis Can be computationally expensive Not generally practical for distributed systems

Algorithmic design Most elegant and efficient solution Examples

Waiting room – allow N-1 accesses to a resource that supports 1 Resource ordering – assign unambiguous ordering to resources, attempt to acquire locks in

order Ticket – give out right to attempt to acquire resources

Requires careful analysis – best to use an existing design

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

Verification and Debugging

Difficult to trace parallel code – need special tools

Impossible to “debug” parallel algorithms Heisenbugs are the rule! Debugging serializes execution

Must verify models instead


(v.9/10/11) 132

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

Two Easy Cases

Explicit Parallelization is hard! Look for

Existing Parallel Implementation “Embarrassingly parallel” implementation of

problem Failing that, look for existing parallel

algorithm for problem


(v.9/10/11) 133

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

Easy Parallelization in STATA Stata-MP: Stata professionally parallelized

dozens of models Completely automatic Well documented:

www.stata.com/statamp/report.pdf Limitations

Parallelization applies only to built-in features Most models are only partially parallelized Some overhead for parallelization Practically limited to 8 processes in most cases


(v.9/10/11) 134

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

http://www.stata.com/statamp/report.pdf

Easy Parallelization in STATA Best cases (>95%)

By; cluster (some); cnsreg; correlate; count; ctset; eivreg; frontier; generate; gprobit; hetprob; hotelling; logit; markout; marksample; matrix accu/score/syminv; mlevel; mlogit; mlmatsum; mlvecsum; ologit; oprobit; pcorr; poisson; predict (some); probit; regress; replace; scobit; strg (gamma); summarize; tobit; treatreg; xtintreg; xtnbreg; xtprobit; _rmcoll

Negligible speedup (<10%) Biplot; centile; cluster (single); cumul; dstdize; fcast; compress; ktau; levelsof;

manova (oneway); matrix eigen/svd/symeigen; mkmat; mprobit oneway; pergram; pkexamine; pperron; reshape; rotatemat; scoreplot; stcox, stcurve; svmat; tabulate total; twoway (lfitci); vargranger; varstable; vec; vecstable; wntestb; xcorr; xtmixed; xtpcse

Limitations Parallelization applies only to built-in features Most models are only partially parallelized Some overhead for parallelization Practically limited to 8 processes in most cases Does not help for memory issues


(v.9/10/11) 135

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

Somewhat Easier Parallelization in R R offers a wide range of frameworks at widely varying levels

of maturity Most mature and well supported framework is snow(),

works on top of sockets, nws, MPI, PVM For single systems with multiple cores, Rmulticore() is straightforward to use

Some packages that use snow automatically: BARD: Redistricting analysis GAMBoost: b-spline boosting Matching: propensity score matching Bnlearn: bayesian learning Rgenoud: genetci optimization Latentnet: latent cluster models … a dozen more packages (so far)


(v.9/10/11) 136

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

Parallelizing R code through lapply() *apply() are core operation

in R It runs a function on each

element of a list/slice of a matrix/array

Returns a processed output list, that may be of reduced dimension

Like a FOR loop over a list, but practically guaranteed to have no dependencies

Natural construct for paralellization


(v.9/10/11) 137

Prepare input

Process input

Manipulate output

Process more

apply()

apply()

((1,2,3),(4,5,6),(7,8,9))

(6,15,24)

Tips on parallelizing R Simply For embarassingly parallel and data-replication only: condor_submit

For coarse grained, more complex parallelization: Parallelize around the apply() operation Massively parallel, distributed, use: snow Single system, multicore use: multicore

For fine grained parallelization Too much overhead in R from interpreted code, and

copying data structures on alteration Use parallelized C through R foreign language interfaces

Seek out packages that are already parallelized by consulting the R HPC taskview


(v.9/10/11) 138


(v.9/10/11) 139

Snow (Simple Network of Workstations)

Simplified pragmatic approach for R

Uses sockets (homegrown), PVM or MPI

Can be used remotely or locally, but higher overhead for local systems

Well-suited for embarrassingly or coarse-grained parallelism

Parallelizes the sapply(), apply(),tapply(),rapply() operators

Not particularly fault tolerant – if hosts go down, you have problems

For socket interface: Use ssh for tunneling Create ssh keys, use ssh key

manager to avoid passwords Not supported by RCE (yet…)

> library(snow)# setup cluster on local machine>cl <- makeSOCKcluster(c("localhost","localhost")) > matrix(1:100,10) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 1 11 21 31 41 51 61 71 81 91 [2,] 2 12 22 32 42 52 62 72 82 92 [3,] 3 13 23 33 43 53 63 73 83 93 [4,] 4 14 24 34 44 54 64 74 84 94 [5,] 5 15 25 35 45 55 65 75 85 95 [6,] 6 16 26 36 46 56 66 76 86 96 [7,] 7 17 27 37 47 57 67 77 87 97 [8,] 8 18 28 38 48 58 68 78 88 98 [9,] 9 19 29 39 49 59 69 79 89 99[10,] 10 20 30 40 50 60 70 80 90 100

# run matrix sum across two cores> parApply(cl, matrix(1:100,10),2, sum) [1] 55 155 255 355 455 555 655 755 855 955

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut


(v.9/10/11) 140

Distributed Computing in R: Tips

snow is most mature, other approaches documented in R HPC task view

snowfall and snowft build fault tolerance/management/single-threaded testing on snow

communication is often limiting performance factor Pay attention to lexical scoping, function serialization, use of

environments to avoid transmitting unnecessary data implicitly

random number generators not recommended for parallel use Preferred: use snow, rsprng, rlecuyer Minimally: set different seeds in each thread

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut


(v.9/10/11) 141

R multicore Simplified pragmatic

approach for R Uses processes on single

system Well-suited for coarse-

grained parallelism Parallelizes the lapply()

operator only Gives some explicit

parallelization options using the fork/join idiom

> matrix(1:100,10) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 1 11 21 31 41 51 61 71 81 91 [2,] 2 12 22 32 42 52 62 72 82 92 [3,] 3 13 23 33 43 53 63 73 83 93 [4,] 4 14 24 34 44 54 64 74 84 94 [5,] 5 15 25 35 45 55 65 75 85 95 [6,] 6 16 26 36 46 56 66 76 86 96 [7,] 7 17 27 37 47 57 67 77 87 97 [8,] 8 18 28 38 48 58 68 78 88 98 [9,] 9 19 29 39 49 59 69 79 89 99[10,] 10 20 30 40 50 60 70 80 90 100

> M<-matrix(1:100,10)# convert to list form, by vector>lM<-lapply(1:(dim(M)[1]),function(x)(M[,x]))# set number of threads to number of cores> options(cores=multicore:::detectCores(all.tests=TRUE))# run across multiple cores> unlist(mclapply(lM,sum))

# if not in console mode, use:# unlist(mcapplySafer(lM.sum))

[1] 55 155 255 355 455 555 655 755 855 955

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

$ condorInteractiveSubmit -x –c 4 /usr/bin/R


(v.9/10/11) 142

R multicore tips Use where communication costs too high for distributed computation Be cautious when running from GUI (not thread safer):

mclapplySafer<- function(X,FUN,…){newFUN<-function(X,…){if(multicore:::isChild()) {multicore:::closeAll()}

FUN(X,…)}

mclapply(X,newFUN,…)}

Look at packages that use it for examples and pre-parallelized functions:

ADaCGH2, BayesPeak, BARD, batch, cMonkey, codelink, CoxBoost, Daim, dcemri, dcemriS4, ddepn, doMC, doMPI, expectreg, flexclust, flowClust, GAMBoost, GGtools, hgam, les, LVSmiRNA, mapReduce, mboost, MeDiChI, MI_Interfaces, Modalclust, pathClass, phangorn, PICS, Rsolnp, Shortread, survey, TWIX

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut


(v.9/10/11) 143

GPU-Powered Statistics -- Simply Simple, limited alternative to

complicated GPU programming

magma , gputools Can offer 10x-100x speedup

Requires Linux Compatible NVIDIA GPU

Not available in RCE;but available on AWS

Supported models: linear algebra; glm models; clustering; ICA

> chooseGpu(deviceId = 0)

> clotting <- data.frame( u = c(5,10,15,20,30,40,60,80,100),lot1 = c(118,58,42,35,27,25,21,19,18),lot2 = c(69,35,26,21,18,16,13,12,12))

> summary(gpuGlm(lot1 ~ log(u), data=clotting, family=Gamma))

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

See : http://bioinformatics.oxfordjournals.org/content/26/1/134.full


(v.9/10/11) 144

Embarrassingly Parallel Tasks Jobs that require no communication,

separate data Easiest jobs to parallelize Variant: Replicated data

Require replication of data at beginning Summarize all results at very end Otherwise no communication Social Science has a lot of these

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

Embarrassingly Parallel Social Science

Model/parameter exploration Independent runs of agent-based simulation Computational sensitivity analysis MCMC chains


(v.9/10/11) 145

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

What do I need to get started? Executable

Many of our users make use of R, a statistical programming language

Just about any application can be used Input File

If you’re using R, your input file is your R script file Some applications may not need input files


(v.9/10/11) 146

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

Running your job

condor_submit_util is your friend! Running with no options defaults to interactive

mode, use this until you feel comfortable with the process.

Example of running non-interactively:

condor_submit_util -i bootstrap.R -a ‘--vanilla --no-save --args $(Process)’ -n 20 -N


(v.9/10/11) 147

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

Batch Processing Programs

condor_status – shows the status of all of the nodes in the pool

condor_q – shows the status of all jobs in the queue condor_submit – submit a cluster of jobs to the

queue condor_submit_util – RCE helper application that

automates the submission process condor_userprio – shows usage statistics and

priorities for users who are actively using pool resources


(v.9/10/11) 148

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

Monitoring the RCE Batch Cluster

http://cod-head.hmdc.harvard.edu http://batch-head.hmdc.harvard.edu


(v.9/10/11) 149

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

http://cod-head.hmdc.harvard.edu/

http://batch-head.hmdc.harvard.edu/

Parallel Batch Jobs in the RCE Easy to use with embarrassingly parallel

problems If resources are available can use hundreds

of cores Supports: R, Stata, Matlab, C, Java, … lots

more


(v.9/10/11) 150

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

How Jobs are Scheduled Condor uses “fair use” algorithm

Users who have used less resources lately have better opportunity to use resources in the near future

Condor priority Lower priority = better chance of getting runtime Priority factor – defaults to 10, may increase or decrease depending on your

affiliation Average usage over time * priority factor = effective priority

When nodes are free Looks at balance of nodes, figures out who gets next “fair share”

When no nodes are free Preemption can occur Jobs will only preempt if the following is true

Job is within preemption window Preempting user has a priority 1/10th of user currently running

Preemption window Short (100 nodes): > 1 hour run time Medium (200 nodes): 1-4 hour run time Long (200 nodes): < 1 hour runtime


(v.9/10/11) 151

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut


(v.9/10/11) 152

RCE Condor Example1. Create a directory for your jobs to run under.

Move your input data there.

$ mkdir condor$ cd condor$ cp PATHTODATA/data .

2. Create N batch scripts, batch.R.[N]. Use the full path in your script, when loading files.

3. Create a "submit" description, test1.submit (you can skip this, and follow the prompts from condor_submit_util):

Universe = vanillaExecutable = /usr/bin/RArguments = --no-save --vanillawhen_to_transfer_output = ON_EXIT_OR_EVICTtransfer_output_files = out.$(PROCESS)NUM_VIRTUAL_MACHINES = 2input = batch.R.$(Process)output = out.$(Process)error = error.$(Process)Log = log.$(Process)Queue 2

4. Submit your job:$ condor_submit test1.submitSubmitting job(s)..........2 job(s) submitted to cluster 24.

5. Check with condor_status and condor_q.

Variations:- RCE provides condor_submit_util and condor_watch utilities to simplify access

- use condor_submit_util –k –I to submitthis uses interactive prompts and keeps the submit file. You can then reuse/modify the submit file.- run one input file, use $(PROCESS) in arguments to have it do different things- add notification parameters, or use condor_watch_util to set up an email notification- chain condor jobs together with dependencies through Condor DAG

http://www.hmdc.harvard.edu/images/stories/hmdc/StatisticalComputing/ResearchComputing/Tutorial/Condor/getstarted_batchproc_in_rce.pdf

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

Scaling Out, Basics: Checklist


(v.9/10/11) 153

Estimate potential gains from parallelization Assess need for synchronization Avoid deadlocks Writing parallel code is hard! Look for two easy

cases: Someone else has done the parallelization

already Completely decomposable into lots of

independent jobs

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

Lab: Scaling Out, Basics Compare stata

MP/SE mlogit and mprobit performance

Launch batch jobs in RCE


(v.9/10/11) 154


igh

Per

form

ance

Sta

tistic

al C

ompu

ting:

Sca

ling

Ou

t


Micah Altman

Fill in LAB!

Scaling Out: Frameworks and Patterns Parallel architectures Parallelization frameworks Parallel Programming Patterns


(v.9/10/11) 155

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

Parallel Architectures

MIMD Variants Uniform Shared Memory Non Uniform Shared

Memory Distributed computing


(v.9/10/11) 156

Single Instruction Multi-Instruction

Single Data

Multi-DataH

igh

Per

form

ance

Sta

tistic

al C

ompu

ting:

Sca

ling

Ou

t

Parallel Architectures in Use MIMD

Uniform (almost) Shared Memory multi-core chips

Distributed, shared nothing clusters and clouds

SIMD “compilers tread where angels fear…” GPU processing Intra-core SSE instructions


(v.9/10/11) 157

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

Parallelization Frameworks Memory

Shared memory Single address space Writes protected by synchronization

Distributed Message passing

Parallelization Control Explicit

Programmer responsible for synchronization details Implicit

Programmer responsible for model/algorithm, synchronization implied


(v.9/10/11) 158

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

Some Parallelization Tools Implicit

Shared Parallelizing compilers Data-flow languages Parallel functional languages

Distributed Single system image (MOSIX) DAGman*

Explicit Shared

R: multicore* Matlab parallel toolbox OpenMp

Distributed MPI


(v.9/10/11) 159

* RCE capable

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut


(v.9/10/11) 160

Frameworks for Parallelization Single Host

Implicit parallelism Parallelized libraries Parallelizing compilers

Explicit parallelism System call level Parallelization frameworks

Distributed Computing Single system image Map-reduce Message passing

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut


(v.9/10/11) 161

Problem Decomposition Types of Decomposition:

Independent Tasks Independent data chunks Independent phases / data-flow (pipelinining)

Example: Simple Data Decomposition N tasks. Split data into N worker. Then return to main task. Basis for loop-level parallelization Useful for sapply() type operations Can be inefficient if data too small, or chunks require

varying computation time

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

Parallel Programming Patterns


(v.9/10/11) 162

Task Decomposition

Linear•Task parallelism

Recursive•Divide and concur

Data Partition

Linear•Geometric Decomposition

Recursive•Recursive data

Information flow

Regular•Pipeline

Irregular•Producer-consumer•Blackboard•Monitor•Agent

* Modified from Mattson et. al

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut


(v.9/10/11) 163

Examples by directionality Divide work/resources needed for problem Minimize dependencies among pieces of problem For distributed work: minimize communication among pieces

Independent(“embarassingly parallel”)

Uni-DirectionalDependency

Complex Dependency

Data Decomposition

Analysis across separate datasets, parameterization studies, sensitivity analysis, MCMC chains,…

Preprocessing, image processing, …

Parallel linear algebra, partitioned sorting, MCMC iterations …

Task/Function Decomposition

Multi-user systems – independent jobs.

Pipelines Loop-parallelization, client-server, producer-consumer

Object Decomposition

--- Monitors, Blackboard, Agents, Peer-Peer, …

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut


(v.9/10/11) 164

More Parallelization Patterns Monitors:

object-oriented build synchronization into data structure Any program designing around data structure access (e.g.

synchronized queue) automatically obtains consistency Producer-Consumer

Multiple producers put tasks/data-chunks in a queue/buffer Multiple consumers take tasks and process them Can be pipelined – consumers of one queue can be

producers for another stage of processing

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

Micah Altman

Add more patterns

Parallel Programming Idioms Used in implementation of larger patterns Flow of control

Single Program Multiple Data Fork-Join Master/Worker Map/Reduce Loop-Parallelism

Data Shared Data Shared Queue Distributed Array

Synchronization Mutual exclusion (locks, semaphores) Barriers Collective communication (broadcast, reduction)


(v.9/10/11) 165

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

Example: DAG-man


(v.9/10/11) 166

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

One DAG Node

• You model the dependencies, DAGman sequences your batch jobs

Dependency Graph

DagMan Details


(v.9/10/11) 167

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

JOB id file• PARENT id CHILD id1 id2 …

Optional Commands RETRY, ABORT-DAG-ON, VARS, SCRIPT PRE/POST

$ condor_submit_dag DAGInputFile

Submission:

Dagman File Syntax :

JOB A A.condor JOB B B.condor JOB C C.condorJOB D D.condorPARENT A CHILD B CPARENT B C CHILD D

Example:

Integrating Analysis and Batch Phases Post-process R jobs for next steps

Dagman PRE/POST to set up phases External manipulation of output files:

PERL, Python Save model output in machine readable form, for

use in other stages Use command-line arguments to control scripts

(e.g. parseCommandArgs in the R “batch” package)


(v.9/10/11) 168

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g: S

calin

g O

ut

Lab: Scaling Out Create and launch

DAG jobs


(v.9/10/11) 169


igh

Per

form

ance

Sta

tistic

al C

ompu

ting:

Sca

ling

Ou

t


Micah Altman

Fill in LAB!

Additional Resources

Free Computing Cloud Computing Software Reading


(v.9/10/11) 170

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Ad

dit

ion

al R

eso

urc

es


(v.9/10/11) 171

Computing Grants

Teragrid www.teragrid.org SDSC www.sdsc.edu ScienceClouds www.scienceclouds.org Amazon EC2 Grants

aws.amazon.com/education/ Google computing grants

research.google.com/university/ Microsoft/NSF computing grants

www.nsf.gov/dir/index.jsp?org=CISE

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Ad

dit

ion

al R

eso

urc

es

http://www.teragrid.org/

http://www.sdsc.edu/

http://www.scienceclouds.org/

http://aws.amazon.com/education/

http://research.google.com/university/

http://www.nsf.gov/dir/index.jsp?org=CISE


(v.9/10/11) 172

Insta-Clusters Grid Appliance www.grid-appliance.org

self configuring Linux grid Runs in “harvested cycles” on windows machines uses Virtual Machines, Condor, P2P networking

Boinc boinc.berkeley.org Large scale, independent computations Like SETI@home Must compile your application with BOINC libraries.

Rocks Cluster www.rocksclusters.org Linux cluster installer Many “rolls” – choices of configurations

Nimbus cluster www.nimbusproject.org Turn a local cluster into a cloud accessible through the

amazon EC2 API’s StarCluster web.mit.edu/stardev/cluster

Build an auto-expanding clusters on Amazon EC2

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Ad

dit

ion

al R

eso

urc

es

http://www.grid-appliance.org/

http://boinc.berkeley.org/

http://www.rocksclusters.org/

http://www.nimbusproject.org/

http://web.mit.edu/stardev/cluster

Cloud Computing Resources General Platform

Amazon EC2 aws.amazon.com/ec2/[Most mature & richly functional]

GoGrid www.gogrid.com ATT Synaptic www.synaptic.att.com FlexiScale www.flexiant.com Rackspace Cloud www.rackspacecloud.com

Restricted Computing Models Google App Engine code.google.com/appengine Microsoft Azure microsoft.com/windowsazure/


(v.9/10/11) 173

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Ad

dit

ion

al R

eso

urc

es

http://aws.amazon.com/ec2/

http://www.gogrid.com/

http://www.synaptic.att.com/

http://www.flexiant.com/

http://www.rackspacecloud.com/

http://code.google.com/appengine

http://www.microsoft.com/windowsazure/

Software Profiling Tools Timing:

Windows: timeit.exe Linux: /usr/bin/time ; getitimer() OS X: /usr/bin/time ; getitimer

Process/Systems Analysis Windows: top.exe; taskmgr.exe; perfmon.exe Windows (advanced): procmon.exe

[http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx] Linux/ OSX top ; gnome-system-monitor; sar

vmstat ; iostat OS X: top ; Utilities->Activity monitor; sar

vmstat; at-monitor System Trace tools

Linux: strace; systap (add-on) OS-X: dtrace

Profilers Matlab: profile R : Rprof(); Rprofmem() Stata: profile C: gprof General linux: oprofile (oprofile.sourceforge.net) General OSX: Dtrace (dtrace.org) Intel: vtune software.intel.com/en-us/intel-vtune/ Linux:oprofile oprofile.sourceforge.net Parallel: tau www.cs.uoregon.edu/research/tau/


(v.9/10/11) 174

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Ad

dit

ion

al R

eso

urc

es

http://software.intel.com/en-us/intel-vtune/

http://www.oprofile.sourceforge.net/

http://www.cs.uoregon.edu/research/tau/


(v.9/10/11) 175

Additional Software Resources

Lots of scientific and HPC libraries:maltman.hmdc.harvard.edu/numal/resources

Lots of statistical and data analysis packages:maltman.hmdc.harvard.edu/socsci.shtml

CRAN HPC Task View for R: cran.r-project.org/web/views/HighPerformanceComputing.html

Large Scale Visualization Mondrian – large categorical data: rosuda.org/Mondrian/ GGOBI – interactive data visualization www.ggobi.org/ VISIT – parallel visualization wci.llnl.gov/codes/visit/ Paraview – parallel visualization www.paraview.org

Large data manipulation PDL , perl data library -- pdl.perl.org NumPy, numerical python – numpy.scipy.org Embed R in Postgresql – www.joeconway.com/plr/

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Ad

dit

ion

al R

eso

urc

es

http://www.hmdc.harvard.edu/micah_altman/numal/resources/

http://www.hmdc.harvard.edu/micah_altman/numal/resources/

http://cran.r-project.org/web/views/HighPerformanceComputing.html




http://www.ggobi.org/

https://wci.llnl.gov/codes/visit/

http://www.paraview.org/paraviewindex.html

http://pdl.perl.org/

http://numpy.scipy.org/

http://numpy.scipy.org/

http://www.joeconway.com/plr/

http://www.joeconway.com/plr/

Micah Altman

- Replace R w/link to R HPC taskview and highlights- Review for dead projs


(v.9/10/11) 176

Further Reading – Computing Hard Problems

Garey & Johnson, 1979, Computers and Intractability. Papadimitriou, 1993, Computational Complexity. Arora, 2009, Computational Complexity: A Modern Approach

Algorithms -- General T. Cormen et al., 2001. Introduction to Algorithms (3rd Ed) . S. Skiena, 1998. Algorithm Design Manual,.

Performance Optimization G.D. Musmeci & M. Loukides, 2002. System Performance Tuning, O’Reilly D. Shasha & P. Bonnet, 2003. Database Tuning, Morgan and Kaufman Faroult and Robson, 2006. Art of SQL. O’Reilly. Menasce, et. al 2004, Performance by Design

Parallelization M. Ben-Ari, Principles of Concurrent and Distributed Programming, Springer Miller & Boxer, 2005. Algorithms Sequential and Parallel Mattson, et. al 2004. Patterns for Parallel Programming

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Ad

dit

ion

al R

eso

urc

es

Micah Altman

Add reading details


(v.9/10/11) 177

Further Reading – Statistical Computation S. Goedecker & A. Holsie, 2001. Performance Optimization of

Numerically Intensive Codes, Siam Press J. Nocedal & S.J. Wright, Numerical Optimization 1999. Springer A Unwin, M Theus, H Hofmann, 2006. Graphics of Large Data Sets –

Springer. G. Givens & J. Hoeting, 2005. Computational Statistics, Wiley. Kontoghiorghes, 2006. Handbook of Parallel Computing and

Statistics. Gentle, et. al, 2004. Handbook of Computational Statistics Gentle, 2009. Computational Statistics. Higham, 2005, Accuracy and Stability of Numerical Algorithms. Aggarwal, 2007, Data Streams Models and Algorithms.

Hig

h P

erfo

rman

ce S

tatis

tical

Com

putin

g :

Ad

dit

ion

al R

eso

urc

es

Micah Altman

Add reading details

Review: Nine Steps to Faster Results1. Predict your resource needs through benchmarks,

models, algorithmic analysis

2. Select alternate algorithms when resource needs grow very rapidly with problem size

3. Identify resource bottlenecks using systems performance analysis tools

4. Address bottlenecks by increasing resources and/or changing program resource management

5. Discover hot-spots in programs using profiling tools

6. Adapt hot-spots to system architecture

7. Decompose the problem into independent subproblems

8. Distribute subproblems across pools of resources

9. Repeat analysis after making any changes


(v.9/10/11) 178


(v.9/10/11) 179

Further Questions?

Statistical RCE accounts & assistance:

- [email protected]

Questions about this talk, etc.:[email protected]

High Performance Statistical Computing

Technology

budgeting computing

kinney high perf

kinneyhigh perf

lots of computing power

resource needs

quantitative research

research design

data analysis