National Partnership for Advanced Computational Infrastructure Advanced Architectures CSE 190 Reagan W. Moore San Diego Supercomputer Center [email protected] http://www.npaci.edu/DICE
Dec 21, 2015
National Partnership for Advanced Computational Infrastructure
Advanced ArchitecturesCSE 190
Reagan W. MooreSan Diego Supercomputer Center
[email protected]://www.npaci.edu/DICE
National Partnership for Advanced Computational Infrastructure
Course Organization
• Professors / TA• Sid Karin - Director, San Diego Supercomputer Center,
<[email protected]>• Reagan Moore - Associate Director, SDSC <[email protected]>• Holly Dail - UCSD TA <[email protected]>
• Seminars• State of the art computer architectures• Mid-term / SDSC tour• Final exam
National Partnership for Advanced Computational Infrastructure
Seminars
• 4/3 : Reagan Moore- Performance evaluation heuristics & modeling
• 4/10 : Sid Karin - Historical perspective • 4/17 : Richard Kaufmann, Compaq - Teraflops systems• 4/24 : IBM or Sun• 5/1 : Mark Seager, LLNL - ASCI 10 Tflops computer• 5/8 : Midterm / SDSC Tour• 5/15 : John Feo, Tera - Multi-threaded architectures• 5/22 : Peter Beckman, LANL - Clusters• 5/29 : Holiday / no class• 6/5 : Thomas Sterling, Caltech - Petaflops computers• 6/12 : Final exam
National Partnership for Advanced Computational Infrastructure
Distributed Archives
Application
Digital Library
Data Mining
Supercomputers for Simulation and Data Mining
Information Discovery
CollectionBuilding
National Partnership for Advanced Computational Infrastructure
Heuristics for Characterizing Supercomputers
• Generators of data - numerically intensive computing• Usage models for the rate at which supercomputers move
data between memory, disk, and archives• Usage models for capacity of the data caches (memory
size, local disk, and archival storage)
• Analyzers of data - data intensive computing• Performance models for combining data analysis with data
movement (between caches, disks, archives)
National Partnership for Advanced Computational Infrastructure
Heuristics
• Experience based models of computer usage• Dependent on computer architecture• Presence of data caches, memory-mapped I/O
• Architectures used at SDSC• CRAY vector computers
• X/MP, Y/MP, C-90, T-90
• Parallel computers• MPPs - Ipsc 860, Paragon, T3D, T3E• Clusters - SP
National Partnership for Advanced Computational Infrastructure
Supercomputer Data Flow Model
CPU Memory
Local Disk
Archive Disk
Archive tape
National Partnership for Advanced Computational Infrastructure
Y-MP Heuristics
• Utilization measured on Cray Y-MP• Real memory architecture - entire job context is in
memory, no paging of data• Exceptional memory bandwidth
• I/O rate from CPU to memory was 28 Bytes per cycle
• Maximum execution rate was 2 Flops per cycle
• Scaled memory on C-90 to test heuristics• Noted that increasing memory from 1 GB to 2 GBs
decreased idle time from 10% to 2 %• Sustained execution rate was 1.8 GFlops
National Partnership for Advanced Computational Infrastructure
Data Generation Metrics
CPU Memory
Local Disk
Archive Disk
Archive tape
7 Bytes/Flop
1 Byte/60 Flop
1 Byte of storage per Flops
1/7 of data persists for a day
1/7 of data sent to archive
Hold data forever
Hold data for 1 week
Hold data for 1 day
All data sent to tape
National Partnership for Advanced Computational Infrastructure
Peak Teraflops System
Compute Engine
LocalDisk
ArchiveDisk
ArchiveTape
0.5-1 TB memorySustain ? GF
? GB/sec
? TB
1 day cache
? MB/sec
1 weekcache ? MB/sec
? TB ? PB
TeraFlops System
National Partnership for Advanced Computational Infrastructure
Data Sizes on Disk
• How much scratch space is used by each job? • Disk space is 20 - 40 times the memory size.• Data lasts for about one day
• Average execution time for long running jobs• 30 minutes to 1 hour
• For jobs using all of memory• Between 48 and 24 jobs per day• Each job uses (Disk space) / (Number of jobs)
• Or 40/48 Memory = 80% of memory
National Partnership for Advanced Computational Infrastructure
Peak Teraflops Data Flow Model
Compute Engine
LocalDisk
ArchiveDisk
ArchiveTape
0.5-1 TB memorySustain 150 GF
1 GB/sec
10 TB
1 day cache
40 MB/sec
1 weekcache40 MB/sec
5 TB0.5-1 PB
TeraFlops System
National Partnership for Advanced Computational Infrastructure
HPSS Archival Storage System
108 GB
SSA RAID
High Performance Gateway Node
High Node Disk Mover HiPPI driver
Wide Node Disk Mover HiPPI driver
54 GB
SSA RAID
108 GB
SSA RAID
108 GB
SSA RAID
54 GB
SSA RAID
108 GB
SSA RAID
108 GB
SSA RAID
Silver NodeStorage / PurgeBitfile / Migration Nameservice/PVL Log Daemon
Silver NodeTape / disk mover DCE / FTP /HIS Log Client
160 GB
SSA RAID
Silver Node Tape / disk mover DCE / FTP /HIS Log Client
830 GB
MaxStrat RAID
9490 RobotFourDrives
3490 Tape
RS6000Tape MoverPVR (9490)
HiPPISwitch
Trail-Blazer3Switch
Silver Node Tape / disk mover DCE / FTP /HIS Log Client
Silver Node Tape / disk mover DCE / FTP /HIS Log ClientSilver Node Tape / disk mover DCE / FTP /HIS Log ClientSilver Node Tape / disk mover DCE / FTP /HIS Log ClientSilver Node Tape / disk mover DCE / FTP /HIS Log Client
Magstar3590 Tape
3494 RobotEight Tape
Drives
Magstar3590 Tape
3494 RobotSeven Tape
Drives
National Partnership for Advanced Computational Infrastructure
Equivalent of Ohm’s Law for Computer Science
• How does one relate application requirements to computation rates and I/O bandwidths?
• Use prototype data movement problem to derive physical parameters that characterize applications.
National Partnership for Advanced Computational Infrastructure
Data Distribution Comparison
Data Handling Platform
Supercomputer
Execution rate r < RBandwidths linking systems are B & bOperations per bit for analysis is COperations per bit for data transfer is c
Reduce size of data from S bytes to s bytes and analyze
Should the data reduction be done before transmission?
Data B b
National Partnership for Advanced Computational Infrastructure
Distributing ServicesCompare times for analyzing data with size reduction from S to s
Read Data
Reduce Data
TransmitData
Network ReceiveData
Read Data
Reduce Data
TransmitData
Network ReceiveData
S / B C S / r c s / r s / b c s / R
c S / Rc S / r S / b C S / RS / B
Data Handling Platform Supercomputer
Data Handling Platform Supercomputer
National Partnership for Advanced Computational Infrastructure
Comparison of Time
T(Super) = S/B + CS/r + cs/r + s/b + cs/R
Processing at supercomputer
Processing at archive
T(Archive) = S/B + cS/r + S/b + cS/R + CS/R
National Partnership for Advanced Computational Infrastructure
Optimization Parameter Selection
Have algebraic equation with eight independent variables.
T (Super) < T (Archive)
S/B + CS/r + cs/r + s/b + cs/R < S/B + cS/r + S/b + cS/R + CS/R
Which variable provides the simplest optimizationcriterion?
National Partnership for Advanced Computational Infrastructure
Scaling Parameters
Data size reduction ratio s/SExecution slow down ratio r/RProblem complexity c/CCommunication/Execution balance r/(cb)
When r/(cb) = 1, the data processing rate is the same as the data transmission rate.
Optimal designs have r/(cb) = 1
Note (r/c) is the number of bits/sec that can be processed.
National Partnership for Advanced Computational Infrastructure
Bandwidth Optimization
Moving all of the data is faster, T(Super) < T(Archive)Sufficiently fast network
b > (r /C) (1 - s/S) / [1 - r/R - (c/C) (1 + r/R) (1 - s/S)]
Note the denominator changes sign when
C < c (1 + r/R) / [(1 - r/R) (1 - s/S)]
Even with an infinitely fast network, it is better to do the processing at the archive if the complexity is too small.
National Partnership for Advanced Computational Infrastructure
Execution Rate Optimization
Moving all of the data is faster, T(Super) < T(Archive)Sufficiently fast supercomputer
R > r [1 + (c/C) (1 - s/S)] / [1 - (c/C) (1 - s/S) (1 + r/(cb)]
Note the denominator changes sign whenC < c (1 - s/S) [1 + r/(cb)]
Even with an infinitely fast supercomputer, it is better toprocess at the archive if the complexity is too small.
National Partnership for Advanced Computational Infrastructure
Data Reduction Optimization
Moving all of the data is faster, T(Super) < T(Archive)Data reduction is small enough
s > S {1 - (C/c)(1 - r/R) / [1 + r/R + r/(cb)]}
Note criteria changes sign whenC > c [1 + r/R + r/(cb)] / (1 - r/R)
When the complexity is sufficiently large, it is faster toprocess on the supercomputer even when data can bereduced to one bit.
National Partnership for Advanced Computational Infrastructure
Complexity Analysis
Moving all of the data is faster, T(Super) < T(Archive)Sufficiently complex analysis
C > c (1-s/S) [1 + r/R + r/(cb)] / (1-r/R)
Note, as the execution ratio approaches 1, the required complexity becomes infinite
Also, as the amount of data reduction goes to zero,the required complexity goes to zero.
National Partnership for Advanced Computational Infrastructure
Characterization of Supercomputer Systems
• Sufficiently high complexity• Move data to processing engine
• Digital Library execution of remote services• Traditional supercomputer processing of applications
• Sufficiently low complexity• Move process to the data source
• Metacomputing execution of remote applications• Traditional digital library service
National Partnership for Advanced Computational Infrastructure
Computer Architectures
• Processor in memory• Do computations within memory• Complexity of supported operations
• Commodity processors• L2 caches• L3 caches
• Parallel computers• Memory bandwidth between nodes
• MPP - shared memory • Cluster - distributed memory
National Partnership for Advanced Computational Infrastructure
Characterization Metric
• Describe systems in terms of their balance
Optimal designs have r/(cb) = 1Equivalent of Ohm’s lawR = C B
• Characterize applications in terms of their complexity
Operations per byte of dataC = R / B
National Partnership for Advanced Computational Infrastructure
Second Example
• Inclusion of latency (time for process to start) and overhead (time to execute communication protocol)
• Illustrate with combined optimization of use of network and CPU
National Partnership for Advanced Computational Infrastructure
Optimizing Use of Resources• Compare time needed to do calculations with time
needed to access data over a network• Time spent using a CPU =
Execution time + protocol processing time= Cc * Sc / Rc + Cp * St / RpWhereSt = size of transmitted data (bytes) Sc = size of application data (bytes)Cc = number of operations per byte of transmitted data for the applicationCp = number of operations per byte to process protocolRc = execution rate of applicationRp = execution rate of protocol
National Partnership for Advanced Computational Infrastructure
Characterizing Latency
• Time during which a network transmits data =
Latency for initiating transfer + transmission time
= L + St / B
WhereL is the round trip latency at the speed of light (sec)B is the bandwidth (bytes/sec)
National Partnership for Advanced Computational Infrastructure
Solve for Balanced System
• CPU utilization time = Network utilization time
• Solve for transmission size as a function of Sc/St
St = L B / [B * Cp / Rp + (B * Cc / Rc) * (Sc / St) -1]
Solution exists when Sc/St > [Rc / (B*Cc)] [1 - B*Cp / Rp]and B * Cp / Rp < 1
National Partnership for Advanced Computational Infrastructure
Comparing Utilization of Resources
• Network utilizationUn = Transmission time / (Transmission + latency) = 1 / [1 + (L * B / St)]
• CPU utilizationUc = Execution time / (Execution + Protocol processing) = 1 / [1 + (Cp * Rc) / (Cc * Rp) * (St / Sc)]
Define h = Sc / St
National Partnership for Advanced Computational Infrastructure
Comparing Efficiencies
h = S-compute / S-transmit
Utilization
U-cpu
U-network
National Partnership for Advanced Computational Infrastructure
Crossover Point
• When utilization of bandwidth and execution resources is balanced:1 / [1 + (L * B / St)] = 1 / [1 + (Cp * Rc) / (Cc * Rp) / h]
For optimal St, solve for h = Sc/St, and findh = (Rc Cp / 2 Rp Cc) [ sqrt(1 + 4 Rp / Cp B) -1]
For small B * Cp / Rph ~ Rc / Cc B or St / B ~ Sc Cc / RcAnd transmission time ~ execution time
National Partnership for Advanced Computational Infrastructure
Application Summary
• Optimal application for a given architectureB * Cc / Rc ~ 1(Bytes/sec) (Operations/byte) / (Operations/sec)Cc ~ Rc / B
• Also need cost of network utilization to be smallB * Cp / Rp < 1
And amount of data transmitted proportional to latency St = L B / [B * Cp / Rp + (B * Cc / Rc) * (Sc / St) -1]
National Partnership for Advanced Computational Infrastructure
Further Information
http://www.npaci.edu/DICE