High Performance Statistical Computing with Applications in the Social Sciences Micah Altman Senior Research Scientist “introduction to the RCE” by, Earl Robert Kinney Manager, Research Computing Environment Institute for Quantitative Social Science Harvard University
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
High Performance Statistical Computing with Applications in the Social Sciences
Micah Altman
Senior Research Scientist
“introduction to the RCE” by,
Earl Robert Kinney
Manager, Research Computing Environment
Institute for Quantitative Social Science
Harvard University
Goals for todayAnalysis Describe performance goals Identify resource use patterns Identify resource bottlenecks Identify performance hot-spots Select problem decomposition
Application Connect to RCE Use the RCE to analyze larger
data sets Use the RCE to run interactive
analyses more quickly Use the RCE to run large numbers
of analyses independently
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 25
[Source: Wikimedia Commons]
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 3
Organization of this Workshop Motivation Principles Introduction to RCE Measuring Resource Use Scaling Up Tuning Up Scaling Out
(Parallelization) Additional Resources
Nine Steps to Faster Results1. Predict your resource needs through benchmarks,
models, algorithmic analysis
2. Select alternate algorithms when resource needs grow very rapidly with problem size
3. Identify resource bottlenecks using systems performance analysis tools
4. Address bottlenecks by increasing resources and/or changing program resource management
5. Discover hot-spots in programs using profiling tools
6. Adapt hot-spots to system architecture
7. Decompose the problem into independent subproblems
8. Distribute subproblems across pools of resources
9. Repeat analysis after making any changes
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 4
FREE! With every first class!
Coffee!Chocolate!!Consulting!!!Time off for good behavior !!!!
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 5 of 85
And now, a word from our sponsor…
6
IQSS (and affiliates) offer you support across all stages of your quantitative research:
Research design, including:design of surveys, selection of statistical methods.
Primary and secondary data collection, including:the collection of geospatial and survey data.
Data management, including:storage, cataloging, permanent archiving, and distribution.
Data analysis, including :survey consulting, statistical software training, GIS consulting, high performance research computing.
Types of Performance Goals Task completion time – wait time to finish Efficiency – resource use for task Throughput –work done by system overall Latency – delay before response Responsiveness – perception of response Reliability – probability task/system will fail
during time interval
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 12
“If you don’t know where you’re going, any road will take you.” – Proverb “If you come to a fork in the road, take it.” – Yogi Berra
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 13
Performance Goals – Rules of Thumb Completion time:
work (i)/resource(i) Throughput:
maximize(work/resource) for all jobs
Latency: time elapsed before first response to input
Real-time: complete task within fixed interval
“Responsiveness” Perceived latency Task completion time Task progress indicators
Rules of thumb
Users of interactive software want responsiveness
Users of batch jobs want small completion times
Systems administrators want maximum throughput, reliability
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
Size of Factors Affecting Performance
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 14
Run Time For Large Instance (n=1000)
NP-Hard (worst case) 10^292 years
Very Inefficient AlgorithmO(N^3)
1.6 years
Inefficient AlgorithmO(N^2)
16 hours
Very Poor Memory Access Patterns
11 hours
Un-optimized Code 67 minutes
Optimized code 7 minutes
Local Multiprocessing 2 minutes
Fully Parallel/Full Cluster
4 seconds
If runtime for solve small instance of a problem (n=10), running on single system is one minute, how long will it take to solve larger instance of n=1000?
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 15
Problem Complexity Classes
Problem complexity class: set of problems that can be solved in O(f(n)) for some f
More general than algorithmic complexity – encompasses all possible algorithms to solve the given problem
Polynomial time algorithm necessary for large problem instances
Decision Problem
Decidable Undecidable
EXPSPACE
EXPTIME
PSPACE
NPCO-NP
P = BPP(?)
BQP
NP-complete
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
Some Problems Are HARD Traveling Salesperson Problem (weighted Hamiltonian cycle): Plot a route through N
locations, visiting each once, that minimizes cost. NP-Hard: worst-case instances require exponential time for optimal, certain, solution NP-Complete: Equivalent to a large class of hard problems
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 16
Source: Applegate, Bixby, Chvátal, and Cook (1998) Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 17
How to “Solve” the Unsolvable Think small: Use only a small number of cities.
Aggregate to regions and treat as quasi-cities. Restrict Problem – Euclidean distances are easier
than travel cost. Solve a different problem: minimum spanning tree Approximate solution: for Euclidean distances, there
is an algorithm based on minimum spanning tree that is at most 50% longer
Randomize: can a randomized algorithm find solution with probability p? (No one knows…, probably not)
Be Lucky: maybe “average” problem isn’t that hard? Heuristics: Apply Simulated Annealing (etc.), cross
fingers
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
How to recognize hard problems… Is the problem routinely solved by existing
systems? Are efficient algorithms known? Does it appear in lists of hard problems? Is the problem universal?
(Any computing problem, sufficiently generalized is hard [Papadimitriou 1994])
Items in a fixed range, no duplicates Inlist = logical(length=max-min) For (i = 1:length(items)) {inlist[items[i]]=TRUE)} For (I in mix:max) if (inlist[i]) dowork(i)
How fast is this? Algorithm Recurse_sort(array L, i = 0, j = length(L)-1)
if L[j] < L[i] then L[i] ↔ L[j] if j - i > 1 then t = (j - i + 1)/3 Recurse_sort (L, i , j-t) Recurse_sort (L, i+t, j ) Recurse_sort (L, i , j-t) return L
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 21
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
Answers Matter
Before optimization, verify the answer Right can mean “right enough” if well-defined Correct code may have different performance
characteristics than incorrect code Returning wrong answer can always be done
• Big, Fast, Cheap – Pick 2• Latency increases with each step down • Storage increases• Throughput decreases
(except, with some offline storage)
Registers (<1KB)
Cache (1 MB)
Ram (10 Gigabytes)
Local Storage (10’s Terabytes)
ONLINE Storage (100’s Petabytes)
OFFLINE STORAGE ( 10’s Exabytes )
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 28
Reading One Byte: x-= m[1,3]
CPU: 8 bytes: Load Register
Cache: 256 Bytes <- Cache Line
RAM: 4K <- Page
Disk <- 8K from NFS
Networked File System
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 29
General Performance Implications of Architecture Talking to external devices can cause waits… (latency) Information transmitted to CPU is limited by bus
(throughput) In practice, expect 80% of theoretical data-path bandwidth at
best Some optimizations are highly specific to architectural
details Hidden parallelism at low levels Information travels in chunks
(at least bus size) Complexity makes theoretical performance analysis
difficult – use benchmarksHig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 30
From Principles to Practice Practice = Principles * Optimization Goals *
Problem Type * Computing Environment Optimization Goals
Throughput Latency Reliability Scaling up Scaling out
Problem Decomposition Independent data Independent calculations Coupled calculations
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
Principled Preparation Checklist
Verify that your problem is tractable: Substitute an easier problem Restrict or limit the problem Be lucky or clever
Establish performance goals Identify possible algorithms
What is their resource complexity? Are better algorithms known?
Identify potential system characteristics Communications costs Systems resources
• Full virtual desktop environment – connect anywhere
• Many research software packages available
• Persistent session – connect anytime In
tera
ctiv
e N
odes
• For large interactive jobs.
• Large amounts of memory available on demand
• Stata, Matlab, Mathmatica
Bat
ch P
roce
ssin
g • Run hundreds of jobs at once.
• Optimized for non-interactive, independent work
• Easy to run from your virtual desktop
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Intr
od
uct
ion
to
RC
E
Why use the RCE? For Research
An environment customized for quantitative social science research A wide variety of research software packages are available
Fore Convenience The RCE enables you to access a research desktop from almost any computer Sessions are persistent -- disconnect from your office, reconnect from home File storage is central. Never worry about which computer has your files
For Resources Large analysis jobs are offloaded to high-powered Large resource pools : 800 processors , 3.3 TB of memory , 40 TB of disk storage Regularly updated software
For collaboration Offers an ideal environment for collaborative research projects Share project files, desktops, software
For reliability System performance and availability is constantly monitored Research files are regularly backed up and stored securely IQSS has full time staff dedicated to the support the RCE
Systems Resource Use Benchmarks Timing System resource monitoring System resource limits
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 49
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
Benchmarks What patterns of usage are likely to occur?
What are the 80% cases? Are there 10% cases that have unusual patterns of
data access, or unusual input? Can you construct a plausible worst-case?
Parameterize benchmarks Parameterize problem size Vary order-of-magnitude
Create benchmarks based on real cases Use real problems for full benchmarking Miniaturize real problems for quick tests
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 50
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
Common Benchmarks
Artificial benchmarks Simple “unit” benchmarks Real application + random data Real application + real data Real application + worst case data Mix of applications
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 51
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
Timing Why measure timings
Direct or indirect measure of performance Establish baseline for changes Empirical measure of scaling
Limitations Timers are often imprecise for brief events Other activity on the system “noise” Many tools aggregate all phases of execution Many tools aggregate all areas of resource use CPU timings may exclude system resource use Must use condor_submit to run these on non-interactive
nodes Heisenbugs
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 52
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
Alternative: Queuing Models Formalists alternative to benchmarks Can be useful for capacity planning Model services as network of queues
Different classes of “customers” Resources with different delay characteristics Transition probabilities
Distribution of “service events” Poisson events discrete, independent, no
memory Number of events are Poisson distributed
interarrival time exponentially distributed Little’s law:
Length of queue = arrival rate * time in queue Limitations
Heroic assumptions are often required State-space explosion Only simplest models solvable closed-form
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 53
Source: Takefusa, et al. 1999
Wall-Clock Time Measure completion time Show phases of execution by inserting calls
*Optional tool, may require installation on your system
Interpreting Process Monitor Results
Show list of processes running See current and accumulated CPU & memory usage See CPU utilization
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 58
$ gnome-system-monitor &
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
Show processes
Sort # of processeswaiting to use CPU
Sort processesBy CPU use
Sample Performance Curves
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 59
• Best case: linear in size of problem• Nonlinearities could mean…
• inefficient algorithm (case 2)• hard problem (case 3)• poor data access patterns (case 4)
System Resource Monitoring Why monitor system resources?
Identify bottlenecks Identify processes using resources – may affect overall
throughput and capacity Identify processes actively using resources – may affect
performance Limitations
Tools are often imprecise for brief events Other activity on the system “noise” Many tools aggregate all phases of execution Many tools aggregate all system use Many tools aggregate sub-resource use Must use condor_submit to run these on non-interactive nodes Heisenbugs
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 60
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
Monitoring System Resources See system aggregated use and activity for memory, disk,
network See memory use by process See resource use by process (varies by platform)
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 61
Linux: gnome-system-monitor ; /usr/bin/time –v sar ; iostat ; vmstat
OS X: Utilities -> “Activity Monitor”/usr/bin/time –v; sar ; iostat
Windows: perfmon.exe; taskmrg.exe
$ gnome-system-monitor &$ sar –A 1 10$ /usr/bin/time –v stata –b somefile.do
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
Detailed System Resource Tracing See system use/calls for process as it runs
Page faults – indicate memory activity or resource contention
File i/o – indicates file activity
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 64
$ sar -bB 1 10
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
System memory activity
System disk activity
Interpreting System Activity
Page faults – indicate memory activity or resource contention
File i/o – indicates file activity
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 65
$ perfmon
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
Major (requiring I/O) page faults: 0Minor (reclaiming a frame) page faults: 149Voluntary context switches: 1280Involuntary context switches: 460Swaps: 0File system inputs: 0File system outputs: 0
Interpreting Process Resource Use
Page faults – indicate memory activity or resource contention Voluntary context switches – indicates waiting on I/O or memory Swaps – indicates a severe system memory shortage File i/o – indicates file activity If the numbers is always 0 – it’s a lie
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 66
$ /usr/bin/time –v stata –b command
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
Often memory related
Process disk I/O
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 67
Symptoms of CPU Bound System/Problem
CPU User+Sys activity near 100% while there are active processes (if # of procs > # of cpus)
Performance curve for your problem is continuous This is usually good
CPU is most expensive resource You can trust code profiling reports More likely to have gains from parallelization
However, if CPU %sys is high suspect inefficient use of system calls, or borderline I.o or memory bottlenecks
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 68
Symptoms of Resource Bottleneck Memory Bottlenecks:
Severe: processes in swap queue (or wait on swap) lots of space in use (see swap –m), swapping activity, free memory low
Moderate: high context switches + high page (validity) faults + active processes with memory >> resident memory
I/O Bottlenecks: Moderate:
high % sys activity in CPU, high # of system calls, # interrupts Severe
I/O rate high Context switches, wait on I/O, or processes sleeping on I/O Physical disk activity high
Performance curve Discontinuous regions of accelerated performance decline
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 69
Tune against bottlenecks Typically, a single resource will be
the bottleneck point: CPU Memory I/O: Graphics, Network, Disk
If you don’t address the bottleneck, optimizations elsewhere won’t matter
Bottlenecks may depend on usage scenario and phase of operation
Fixing one bottleneck may reveal others
Don’t expect speedup of the entire program to be proportional to the code you just tuned!
Programs interact, try to profile on a quiet system first
Hig
h P
erfo
rman
ce C
ompu
ting:
Sca
ling
Up
Resource Analysis: Checklist
Identify benchmarks Small instances of your problem Can vary size
Target an isolated system Minimize other activity
Time benchmarks at various sizes Monitor systems resources Look for non-linearities in performance curve Look for bottlenecks
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 70
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
Lab: Analyzing Resource Use In this lab, we will
login to the RCE and run a simple set of benchmarks
Scaling Up Addressing resource bottlenecks System and application limits Storing/accessing large datasets Visualizing large datasets
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 72
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 73
When to Scale Up If resource analysis identifies a memory bottleneck If resource analysis identifies an I/O bottleneck
(maybe …) If problem size prevents program from starting If program crashes or hangs in the middle of
solving large problems (maybe…) If planning ahead for significant usage changes:
- size of problem data >
~1/2 available physical memory (RAM)- change of algorithm- change of data structure
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 74
Addressing Memory Bottlenecks Review: Symptoms of memory bottleneck
Discontinuity in performance curve Memory size of process increasing Resident memory size of process relatively large System activity shows memory activity
Principals of addressing memory bottlenecks Memory hierarchy Locality of reference
Programming patterns Add more resources Modify data types Modify data structures Modify algorithms
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
Memory Hierarchy
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 75
Registers (<1KB)
Cache (1 MB)
Ram (10 Gigabytes)
Local Storage (10’s Terabytes)
ONLINE Storage (100’s Petabytes)
OFFLINE STORAGE ( 10’s Exabytes )
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
If a register access took a second, tape access would take a few centuries..
Buy one, get 8092 free!
Locality of reference Temporal locality: reuse same data elements Spatial locality: use elements that are “near” each-other in
memory What is “near”?
For vectors and files: sequential ordering For matrices: either row or column ordering
depends on language For complex data structures:
use experimentation and analysis
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 76
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
Row-Major Order
Adding More Resources “$$$” Optimization Buy more memory, or…
use the RCE to request a larger share This is effective if local set size < share size
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 77
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
System and Application Resource Limits Limits imposed by system or application Virtual Memory
Logical memory space for process Virtual memory limits maximum size of memory requrested Can prevent program from starting, or loading large data
Physical Memory Physical RAM installed in system Usually smaller than VM, but not always Maximum efficient local set
Resident Size Limits Affects maximum efficient local set
- not as severely as physical limits
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 78
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 79
Limits in Linux and OS X Where limits are set:
Set at bootup Set by system at login – group/user level total memory limits Set in shell at process creation – request new limit (up to user
maximum) Set in code via setrlimit Set in application
Know your limits Linux/OS X: /usr/bin/ulimit –a R: none for Linux Stata: query memory
Limits on 32 v. 64 bit systems 32 bit OS has limit of 4GB for virtual & physical memory 64 bit OS
No practical limit on virtual memory Physical memory still limited by hardware configuration and design Data structures may require more memory to store, since pointers
and default data types are larger
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 80 of 85
Limits in windows systems Where limits are set:
Limits implied by configuration at boot Virtual memory typically depends on configured paging space on disk + pagefile
R: memory.limit() Limits on 32 v. 64 bit systems
Most 32 bit windows OS has limit of 3GB physical memory 32 Bit addressing allows 4GB, but 1GB reserved for memory mapped hardware,
so only 3GB left over in most Windows configurations 64 bit OS
No practical limit on virtual memory (8 TB) Physical memory still limited by hardware configuration and design Data structures may require more memory to store, since pointers and default data
types are larger Some windows applications are 32-bit versions, so still limited to 4GB virtual
memory.
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
Basic Memory Management in Statistical Software
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 81
Matlab R Stata
Memory Limit --- memory.size()[windows only]
set memory
Remove objects CLEAR rm() clear
Shrink data types
--- as.integer(real_val)as.factor(string_val)
compress
Measure data size
object.size()gc()
memory
Order for virtual memory
PACK gc() set virtual
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
Data Structures and Memory In some circumstances, few
data points * lots of dimensions = huge matrices
Alternate representations can help Sparse matrix Network/graph Linked list Hash
Matlab and R offer these Stata doesn’t offer much
control over this M. Altman & B. Kinney
High Perf. Stat. Computing (v.9/10/11) 82
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 83
Algorithmic Approaches for Memory Bottlenecks Software Default memory model
R in-memory by default. Can work around through DBI() interfaces, chunking data.
SAS/SPSS cross-tabs, descriptive stats, correlations can be done without large memory, GLM etc. is done in memory
Numerical Python/Perl Data Language/Lapack libraries: for large matrix manipulation
MySQL/PostGres – most analyses done streaming, with small memory size
Use memory efficient options for MLE, optimization,(e.g. Conjugate Gradient method for optim() , avoid Hessians)
Use memory-friendly techniques Blocking Divide and conquer
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
Example: Computing Variance
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 84
Source: Wikipedia
def two_pass_variance(data): n = 0 sum1 = 0 for x in data: n = n + 1 sum1 = sum1 + x mean = sum1/n sum2 = 0 for x in data: sum2 = sum2 + (x - mean)**2 variance = sum2/(n - 1) return variance
def naive_variance(data): n = 0 Sum = 0 Sum_sqr = 0 for x in data: n = n + 1 Sum = Sum + x Sum_sqr = Sum_sqr + x*x mean = Sum/n variance = (Sum_sqr - Sum*mean)/(n - 1) return variance
def online_variance(data): n = 0 mean = 0 M2 = 0 for x in data: n = n + 1 delta = x - mean mean = mean + delta/n M2 = M2 + delta*(x – mean) variance_n = M2/n variance = M2/(n - 1) return variance
Two Pass AlgorithmOne-PassNumerically Unstable
One-Pass(Accurate)
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 85
Storing Large Data: Compression Advantages:
Reduces size on disk Can decrease load times
(disk is slow, CPU can keep up with decompression) Disadvantages
Generally does not change size in working memory (compressed data structures are rare)
Less efficient for data that will be updated(need to write a separate copy of file)
Tools Separate stage: gzip/zip Pipes avoid creating separate copies of extracted files: gzip/gunzip –c Access compressed data directly in R: gzfile()H
igh
Per
form
ance
Sta
tistic
al C
ompu
ting
: S
calin
g U
p
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 86
Storing large data: data formats Making the most of formats
Use storage features: binary storage, internal compression Choose efficient data types: integers, numeric, factor; avoid character, custom types Partition sets of variables into different data frames by hand
Key-Value based Data Storage Details
B-tree based or hash file (e.g., gdbm, berkeley db) Key-value pair used for storage/quick retrieval Most efficient for multiple-reader of arbitrary values Less efficient for frequent updates, continuous reads
Tools – See R task view Scientific data formats
Details Common formats: HDF, NETCDF Built for complex hierarchical data structures Efficient storage/updates of entire variables from within datasets (uses b-trees)
Tools Require external support libraries R packages for ncdf/cdf/hdfH
igh
Per
form
ance
Sta
tistic
al C
ompu
ting
: S
calin
g U
p
Large Data: Analysis Virtual Memory
Configure system with lots of virtual memory – swap space (not currently RCE option)
Use R/Stata settings for better VM management Locality of reference critical
Subsampling Good for means and moments May miss outliers Stratify on outliers if oossible
Data Reduction With External Tools Partition data horizontally/vertically Reduce measurements (e.g. code text) Convert data structures
Data streaming/data blocking Can algorithm compute in one pass? Can algorithm compute in blocks of arbitrary size?
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 87
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 88
Storing Large Data: DatabasesSystematic storage
Studies organized into databases Database partitioned into multiple
tables Tables contain sets of variables Data from multiple tables join’ed on
Layout of data in tables Appropriate normalization
SQL statements: SQL is Declarative, not procedural Stick to relational model as much as possible
(e.g., no functions, ordering in selection rules, only in final results) Put database query optimizer to work for you
Database Structures Affecting Performance Indexes: Optimizes ad-hoc row-wise retrieval Cursors: Optimize retrieving a subset of data incrementally Materialized Views: predefined subset of data for retrieval
Performance Tuning Data types: Native types fastest, variable-length character fields, large objects can
affect performance Read vs updates:
Reads do not require exclusive locks (generally) Writes require exclusive locks. If lots of writes, use row-level locks. Indexes help reads, hurt writes
Some summary statistics can be computed in SQL, regardless of main memory limitations in R
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 90
Optimizations for Databases: Gotchas Does not save disk space (often more space needed for
indices) Difficult to index ad-hoc retrieval of variables – efficiency
depends on fixed table decomposition Limited data types & numeric precision Can be much slower than storing in R (etc.) for data that
would fit in main memory If doing analysis in some other program, still need to
manage size of extracted subsets to avoid memory limitations
Overhead for access and indexing: indexing always slows down updates to database
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 91
Large Data Analysis: Graphics Types of graphical issues
Areal plots Display is independent of number of observations Use logical zooming for large numbers of categories Use redlining Examples: histograms, mosaic plots, fluctuation diagrams
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
Scaling Up: Checklist
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 93
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
Perform resource usage analysis first Pay attention to locality of reference, memory
hierarchy Examine resource limits Consider adding memory resources Consider data types, program memory management Consider data structures Modify algorithms: blocking, divide and conquer Consider a database / data streaming algorithms Use visualization techniques suitable for large data
Get the answer right (or as right as you need), first Use appropriate algorithms Identify use-case based benchmarks Look for system bottlenecks Examine processes Profile applications Tune small, expensive regions of code
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: T
un
ing
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 97
Profiling methodologies– Wall Clock Usually instrumented ad-hoc Very easy to do for very small programs Very coarse grained Selects on dependent variable Stata: timer R: trace(glm,
Automatically inserts interrupt handler into code Timer fires interrupt regularly Interrupt handler builds up statistical picture of runtime behavior Uses “call graph” to determine where program is spending time Useful for identifying CPU intensive functionsL
Limitations Does not include bottlenecks that cause process to sleep/block/swap
(i.e. waiting for a file open/ network read to complete, swapping out large chunks of memory)
Somewhat imprecise – need larger runtimes (10’s of seconds should be fine)
Need representative input data May be limited in reporting details of calling sequence Usually limited to function calls, won’t show details on blocks of code Heisenbugs
User time (seconds): 357.04 System time (seconds): 3.51 Percent of CPU this job got: 94% Elapsed (wall clock) time (h:mm:ss or m:ss): 6:23.02 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 0 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 907 Minor (reclaiming a frame) page faults: 229112 Voluntary context switches 0 Involuntary context switches: 0 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0
Example (step 4: single process timing)
- Time –v output shows process is CPU bound. -No discrepancy between wall-clock and CPU time.
I used ‘time –v R …’ to look at a smaller example:
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: T
un
ing
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 104
Example (step 4.1: mini-profile)I cooked a short R test program, using a subset of the original data, timing calls, and simpler stat analysis:date()
% /usr/bin/time –v R –no-save < R.example[1] "Wed Dec 26 14:47:22 2001"> y<-read.table("da6",header=TRUE,row.names=,sep="\t")> date()[1] "Wed Dec 26 14:53:44 2001"> summary(y$CLIDNO) Min. 1st Qu. Median Mean 3rd Qu. Max. 11 5807 9999 8119 9999 9999> date()[1] "Wed Dec 26 14:53:44 2001"> q()
The internal timing calls alone show that 99%+ of the time was spent in read.table()
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: T
un
ing
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 105
Example (step 4.2: test)
- Pre-slicing the data using ‘cut’ to produce only the variable being analyzed reduced read.table() time to a few seconds.
- But the real code differed from my benchmark
- the real problem was reduced from >30 minutes to 5 minutes, almost all of it still in R
- summary() in R only took .1 seconds on the same data, so something else was going on in the R codeH
igh
Per
form
ance
Sta
tistic
al C
ompu
ting:
Tu
nin
g U
p
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 106
Example (step 4.3: profile)
‘Rprof’ showed that the mode() operation was taking an inordinate amount of time. We decided to eliminate that statistic.
Later R developers fixed this, by using a more efficient sort algorithm % total % self total seconds self seconds name 99.29 272.92 0.01 0.02 "univarStat"
Identify Small & Expensive Regions of Code To double the speed of a program, the code you
tune has to be > 50% of the execution time ‘Optimizing’ large amounts of code is ineffective
and hard to maintain If code takes only a small %age of run time,
there is risk of ‘overtuning’ (tuning to the specific situation in a way that does not have general benefits)
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: T
un
ing
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 108
Tuning Strategies: General Waste not want not… Put off until tomorrow… Pre-processing Efficient/Approximate/Randomized/Optimistic algorithms Appropriate Data structures Find the smallest chunk of code that takes the biggest
share of time Never tune without profiling It’s hard to beat the professionals
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: T
un
ing
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 109
Tuning Strategies: CPU CPU
Interpreted code is slower than compiled code Use Compiler optimizations/ select better compiler
Compiler can take advantage of low level memory hierarchy details
Compiler can take advantage of instruction pipelining (overlapping instructions in the CPU)
Correct order Correct chunk size Locality of reference Pre-fetching
Memory Avoid duplication Efficient access order Locality of reference – process in
contiguous blocks Network
Correct chunk size Avoid frequent name resolution Use appropriate protocol Use a local address if you got
it…
Hig
h P
erfo
rman
ce C
ompu
ting:
Spe
edin
g U
p
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 111
Tuning Tips: R Use Rprof/Rprofmem/Proftools R is optimized for whole data objects:
Vector/matrix operations much faster than explicit loops Use replacement methods to update objects
R performance tips As in previous slides – avoid unnecessary duplication of objects through
assignments, growing objects, etc. In R -- changing any object copies it Remove attributes from large objects if possible Manage intermediate results
save it if you need it reuse/remove if not (or avoid naming results at all)
Compiled code for performance gains Built in functions based on .Internal() R functions much faster than R code Use C++ for small, time-consuming operations such as inner loops Link with optimized libraries (such as ATLAS) for matrix algebra
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: T
un
ing
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 112
Example: Matrix Marginsse1<-function(m) { sumvector=numeric(dim(m)[2]) for (i in (1:dim(m)[1])) { for ( j in (1:dim(m)[2]) ) {
sumvector[j]<-sumvector[j]+m[i,j] }
} return(sumvector)}
se2<-function(m) { sumvector=numeric(dim(m)[2]) for (j in (1:dim(m)[2])) { for ( i in (1:dim(m)[1]) ) {
sumvector[j]<-sumvector[j]+m[i,j] }
} return(sumvector)}
se3<-function(m) { sumvector=numeric(dim(m)[2]) for (j in (1:dim(m)[2])) { sumvector[j]<-sum(m[,j]) } return(sumvector)}
se4<-function(m) { apply(m,2,sum)}
se1a<-function(m) { sumvector=numeric(dim(m)[1]) for (i in (1:dim(m)[2])) { for ( j in (1:dim(m)[1]) ) {
sumvector[j]<-sumvector[j]+m[j,i] }
} return(sumvector)}
se2a<-function(m) { sumvector=numeric(dim(m)[1]) for (j in (1:dim(m)[1])) { for ( i in (1:dim(m)[2]) ) {
sumvector[j]<-sumvector[j]+m[j,i] }
} return(sumvector)}
se3a<-function(m) { sumvector=numeric(dim(m)[1]) for (j in (1:dim(m)[1])) { sumvector[j]<-sum(m[j,]) } return(sumvector)}
se4a<-function(m) { apply(m,1,sum)}
mm<-matrix(runif(5000^2),nrow=5000)> system.time(res1<-se1(mm)) user system elapsed 437.48 0.36 512.83 > system.time(res2<-se2(mm)) user system elapsed 412.16 0.29 464.53 > system.time(res3<-se3(mm)) user system elapsed 2.09 0.24 8.77 > system.time(res4<-se4(mm)) user system elapsed 2.99 0.38 3.68 > system.time(res1a<-se1a(mm)) user system elapsed 427.87 0.34 478.53 > system.time(res2a<-se2a(mm)) user system elapsed 436.04 0.36 473.20 > system.time(res3a<-se3a(mm)) user system elapsed 2.03 0.35 2.46 > system.time(res4a<-se4a(mm)) user system elapsed 3.54 0.45 4.17 > system.time(res4a<-se4a(mm)) user system elapsed 3.67 0.53 4.40
- Split into two sections: parallel fundamentals/decomp & abstract frameworks VS implementations (open MP, condor, R packages etc)
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 118
Why Parallelize?
Time vs. Money Most computers are idle most of the time Be the first on the block to have your own
compute cluster
but… Parallelization can decrease throughput Often takes considerable effort Difficult to ensure correct operation
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Sources of Parallelism Intra-core
Pipelining: execute different instruction at each pipeline stage
Hyperthreading: very fast switching between two processes, when one is “stalled” in pipeline
SSE : specialized vector operations on 8 data points simultaneously
FPU: floating point math done in separate unit
Intra-system GPU: specialized simultaneous updates of
large matrices Multi-core: shared memory, independent
programs Distributed
Separate computers Connected by network “Shared nothing”
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 119
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Processor
Processor
RAID Controller
Processor
Memory
GPU
DiskDiskDisk
Network Card
Core
Core
L1
L1
L2
FPU
FPU
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 120
Ahmdahl’s ‘Law’ Most code is not 100%
parallelizable/tunable Speedup limited by percentage
of time non-tunable requires Implication 1: may not be worth
the effort of tuning if tunable code takes up less than 80% of current runtime
Implication 2: most programs will not benefit significantly from running on more than 4 processors
Implication 3: if communication & synchronization costs are high, problem may take longer to solve
np
p1
1
0.0 0.2 0.4 0.6 0.8 1.0
1.0
1.5
2.0
2.5
3.0
3.5
4.0
percent parallelizable
spee
dup
(4
pro
cess
ors
)
0.0 0.2 0.4 0.6 0.8 1.0
05
10
15
20
25
30
x
ah
md
ah
l(x,
n =
32
)
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 121
Resource Contention
Contention = too many requests for same resources E.g. : 10 processors writing to same block of
memory, disk file, etc. Use caching ( for reads), data blocking Most common on SMP/multi-core
architectures
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 122
Synchronization When processes run in parallel, the
operations could be arbitrarily interleaved
Isolate code that cannot be interleaved in critical regions
Use synchronization to coordinate access to critical regions
A race condition results when the final answer depends on the order in which operations execute.
Race conditions are almost impossible difficult to debug by trial-and-error
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 123
Race Condition ExampleATM_Withdrawal_loop () { while (1) {
// wait for loginaccountNum=waitForLogin()// get account balance from databasecurBalance = get_balance(accountNum)// ask user for amountamount = get_withdrawal(curBalance)newBalance = curBalance-amountset_balance(accountNum,newBalance)dispense_cash(amount)
}}
Meanwhile at some other locationIdentical account is used!…
// wait for loginaccountNum=waitForLogin()// get account balance from databasecurBalance = get_balance(accountNum)// ask user for amountamount = get_withdrawal(curBalance)newBalance = curBalance-amountset_balance(accountNum,newBalance)dispense_cash(amount)
Synchronization Building Blocks Tasks: Independent execution sequences Test-and-set:
atomic test and makes safe// SYNCHRONIZATIONwhile (testAndSet(flag));// Critical region// SYNCHRONIZATIONflag = FALSE
Locks: test-and-set + automatic wait loop
Semaphores:Countdown – N processes allowed in critical region+ Queue – when one process releases the semaphoe all get a turn
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 128
No Race Condition -- SynchronizedATM_Withdrawal_loop () { while (1) {
// wait for loginaccountNum=waitForLogin()
// SYNCHRONIZATIONwaitForLock(accountNum)// BEGIN CRITICAL REGION
// get account balance from databasecurBalance = get_balance(accountNum)// ask user for amountamount = get_withdrawal(curBalance)newBalance = curBalance-amountset_balance(accountNum,newBalance)
// get account balance from databasecurBalance = get_balance(accountNum)// ask user for amountamount = get_withdrawal(curBalance)newBalance = curBalance-amountset_balance(accountNum,newBalance)
Deadlock occurs when all processes are waiting for resources held by other processes
Starvation occurs when one or more processes receives no resources over time
Process2
Process3
Process1
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 130
Deadlock Example
Try simultaneously: A->B B->C C->A
ATM_transfer_loop () { while (1) {
// wait for loginaccountNum = waitForLogin()accountNum2 = askForAccount()
// SYNCHRONIZATIONwaitForLock(accountNum)waitForLock(accountNum2)// BEGIN CRITICAL REGION// get account balance from databasecurBalance=get_balance(accountNum)curBalance2=get_balance(accountNum2)// ask user for amountamount = get_withdrawal(curBalance)newBalance = curBalance-amountnewBalance2 = curBalance+amountset_balance(accountNum,newBalance)set_balance(accountNum2,newBalance2)// SYNCHRONIZATIONreleaseLock(accountNum)releaseLock(accountNum2)
}}
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 131
Avoiding Deadlock
Timeouts Release locks after waiting for some set period Simple to implement, pragmatic Cons:
Inefficient Programs need to get all locks at once before entering any critical regions, or release of lock
may be unsafe Timeout values may need tuning, should be somewhat randomized (to avoid cycling)
Resource graph analysis Can be computationally expensive Not generally practical for distributed systems
Algorithmic design Most elegant and efficient solution Examples
Waiting room – allow N-1 accesses to a resource that supports 1 Resource ordering – assign unambiguous ordering to resources, attempt to acquire locks in
order Ticket – give out right to attempt to acquire resources
Requires careful analysis – best to use an existing design
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Verification and Debugging
Difficult to trace parallel code – need special tools
Impossible to “debug” parallel algorithms Heisenbugs are the rule! Debugging serializes execution
Must verify models instead
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 132
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Two Easy Cases
Explicit Parallelization is hard! Look for
Existing Parallel Implementation “Embarrassingly parallel” implementation of
problem Failing that, look for existing parallel
algorithm for problem
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 133
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Easy Parallelization in STATA Stata-MP: Stata professionally parallelized
dozens of models Completely automatic Well documented:
www.stata.com/statamp/report.pdf Limitations
Parallelization applies only to built-in features Most models are only partially parallelized Some overhead for parallelization Practically limited to 8 processes in most cases
Limitations Parallelization applies only to built-in features Most models are only partially parallelized Some overhead for parallelization Practically limited to 8 processes in most cases Does not help for memory issues
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 135
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Somewhat Easier Parallelization in R R offers a wide range of frameworks at widely varying levels
of maturity Most mature and well supported framework is snow(),
works on top of sockets, nws, MPI, PVM For single systems with multiple cores, Rmulticore() is straightforward to use
Some packages that use snow automatically: BARD: Redistricting analysis GAMBoost: b-spline boosting Matching: propensity score matching Bnlearn: bayesian learning Rgenoud: genetci optimization Latentnet: latent cluster models … a dozen more packages (so far)
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 136
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Parallelizing R code through lapply() *apply() are core operation
in R It runs a function on each
element of a list/slice of a matrix/array
Returns a processed output list, that may be of reduced dimension
Like a FOR loop over a list, but practically guaranteed to have no dependencies
Natural construct for paralellization
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 137
Prepare input
Process input
Manipulate output
Process more
apply()
apply()
((1,2,3),(4,5,6),(7,8,9))
(6,15,24)
Tips on parallelizing R Simply For embarassingly parallel and data-replication only: condor_submit
For coarse grained, more complex parallelization: Parallelize around the apply() operation Massively parallel, distributed, use: snow Single system, multicore use: multicore
For fine grained parallelization Too much overhead in R from interpreted code, and
copying data structures on alteration Use parallelized C through R foreign language interfaces
Seek out packages that are already parallelized by consulting the R HPC taskview
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 138
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 139
Snow (Simple Network of Workstations)
Simplified pragmatic approach for R
Uses sockets (homegrown), PVM or MPI
Can be used remotely or locally, but higher overhead for local systems
Well-suited for embarrassingly or coarse-grained parallelism
Parallelizes the sapply(), apply(),tapply(),rapply() operators
Not particularly fault tolerant – if hosts go down, you have problems
For socket interface: Use ssh for tunneling Create ssh keys, use ssh key
manager to avoid passwords Not supported by RCE (yet…)
> M<-matrix(1:100,10)# convert to list form, by vector>lM<-lapply(1:(dim(M)[1]),function(x)(M[,x]))# set number of threads to number of cores> options(cores=multicore:::detectCores(all.tests=TRUE))# run across multiple cores> unlist(mclapply(lM,sum))
# if not in console mode, use:# unlist(mcapplySafer(lM.sum))
[1] 55 155 255 355 455 555 655 755 855 955
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
$ condorInteractiveSubmit -x –c 4 /usr/bin/R
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 142
R multicore tips Use where communication costs too high for distributed computation Be cautious when running from GUI (not thread safer):
4. Submit your job:$ condor_submit test1.submitSubmitting job(s)..........2 job(s) submitted to cluster 24.
5. Check with condor_status and condor_q.
Variations:- RCE provides condor_submit_util and condor_watch utilities to simplify access
- use condor_submit_util –k –I to submitthis uses interactive prompts and keeps the submit file. You can then reuse/modify the submit file.- run one input file, use $(PROCESS) in arguments to have it do different things- add notification parameters, or use condor_watch_util to set up an email notification- chain condor jobs together with dependencies through Condor DAG
Explicit parallelism System call level Parallelization frameworks
Distributed Computing Single system image Map-reduce Message passing
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 161
Problem Decomposition Types of Decomposition:
Independent Tasks Independent data chunks Independent phases / data-flow (pipelinining)
Example: Simple Data Decomposition N tasks. Split data into N worker. Then return to main task. Basis for loop-level parallelization Useful for sapply() type operations Can be inefficient if data too small, or chunks require
Examples by directionality Divide work/resources needed for problem Minimize dependencies among pieces of problem For distributed work: minimize communication among pieces
Independent(“embarassingly parallel”)
Uni-DirectionalDependency
Complex Dependency
Data Decomposition
Analysis across separate datasets, parameterization studies, sensitivity analysis, MCMC chains,…
Preprocessing, image processing, …
Parallel linear algebra, partitioned sorting, MCMC iterations …
Multiple producers put tasks/data-chunks in a queue/buffer Multiple consumers take tasks and process them Can be pipelined – consumers of one queue can be
producers for another stage of processing
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Micah Altman
Add more patterns
Parallel Programming Idioms Used in implementation of larger patterns Flow of control
Single Program Multiple Data Fork-Join Master/Worker Map/Reduce Loop-Parallelism
Data Shared Data Shared Queue Distributed Array
Synchronization Mutual exclusion (locks, semaphores) Barriers Collective communication (broadcast, reduction)
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 165
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Example: DAG-man
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 166
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
One DAG Node
• You model the dependencies, DAGman sequences your batch jobs
Lots of scientific and HPC libraries:maltman.hmdc.harvard.edu/numal/resources
Lots of statistical and data analysis packages:maltman.hmdc.harvard.edu/socsci.shtml
CRAN HPC Task View for R: cran.r-project.org/web/views/HighPerformanceComputing.html
Large Scale Visualization Mondrian – large categorical data: rosuda.org/Mondrian/ GGOBI – interactive data visualization www.ggobi.org/ VISIT – parallel visualization wci.llnl.gov/codes/visit/ Paraview – parallel visualization www.paraview.org
Large data manipulation PDL , perl data library -- pdl.perl.org NumPy, numerical python – numpy.scipy.org Embed R in Postgresql – www.joeconway.com/plr/
- Replace R w/link to R HPC taskview and highlights- Review for dead projs
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 176
Further Reading – Computing Hard Problems
Garey & Johnson, 1979, Computers and Intractability. Papadimitriou, 1993, Computational Complexity. Arora, 2009, Computational Complexity: A Modern Approach
Algorithms -- General T. Cormen et al., 2001. Introduction to Algorithms (3rd Ed) . S. Skiena, 1998. Algorithm Design Manual,.
Performance Optimization G.D. Musmeci & M. Loukides, 2002. System Performance Tuning, O’Reilly D. Shasha & P. Bonnet, 2003. Database Tuning, Morgan and Kaufman Faroult and Robson, 2006. Art of SQL. O’Reilly. Menasce, et. al 2004, Performance by Design
Parallelization M. Ben-Ari, Principles of Concurrent and Distributed Programming, Springer Miller & Boxer, 2005. Algorithms Sequential and Parallel Mattson, et. al 2004. Patterns for Parallel Programming
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Ad
dit
ion
al R
eso
urc
es
Micah Altman
Add reading details
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 177
Further Reading – Statistical Computation S. Goedecker & A. Holsie, 2001. Performance Optimization of
Numerically Intensive Codes, Siam Press J. Nocedal & S.J. Wright, Numerical Optimization 1999. Springer A Unwin, M Theus, H Hofmann, 2006. Graphics of Large Data Sets –
Springer. G. Givens & J. Hoeting, 2005. Computational Statistics, Wiley. Kontoghiorghes, 2006. Handbook of Parallel Computing and
Statistics. Gentle, et. al, 2004. Handbook of Computational Statistics Gentle, 2009. Computational Statistics. Higham, 2005, Accuracy and Stability of Numerical Algorithms. Aggarwal, 2007, Data Streams Models and Algorithms.
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Ad
dit
ion
al R
eso
urc
es
Micah Altman
Add reading details
Review: Nine Steps to Faster Results1. Predict your resource needs through benchmarks,
models, algorithmic analysis
2. Select alternate algorithms when resource needs grow very rapidly with problem size
3. Identify resource bottlenecks using systems performance analysis tools
4. Address bottlenecks by increasing resources and/or changing program resource management
5. Discover hot-spots in programs using profiling tools
6. Adapt hot-spots to system architecture
7. Decompose the problem into independent subproblems
8. Distribute subproblems across pools of resources