Parallelizing Legacy Applications in Message
Passing Programming Model and the Example of MOPAC
Tseng-Hui (Frank) Lin
April 7, 2000 2
Legacy Applications
• Performed functions still useful
• Large user population• Invested big money• Rewriting is expensive• Rewriting is risky
• Changed through long time period
• Modified by diff people• Historical code• Dead code• Old concepts
• Major bugs fixed
April 7, 2000 3
What Legacy Applications Need
• Provide higher resolution
• Run bigger data
• Graphic representation for scientific data
• Keep certified
April 7, 2000 4
How to Meet the Requirements
Improve performance:
Parallel computing Keep Certified:
Change critical parts only• Better user interface:
Add GUI
April 7, 2000 5
System Configuration
April 7, 2000 6
Distributed vs Shared Memory
April 7, 2000 7
Message Passing Programming
• Non-parallelizable parts– Data dependent forces sequential execution– Not worthy to parallelize
• Workload distribution • Input data distribution • Distributed Computation
– Load balance
• Results collection
April 7, 2000 8
Non-Parallelable Parts
pT
P
pT
Sp
P
P
T
TS
)1(1
1
Amdahl’s Law
April 7, 2000 9
MOPAC
• Semi-empirical molecular orbital pkg– MNDO, MINDO/3, AM1, PM3
• MOPAC 3 submitted to QCPE in 1985
• MOPAC 6 ported to many platforms– VMS– UNIX (our work based on this version)– DOS/Windows
• MOPAC 7 is current
April 7, 2000 10
MOPAC input file
L1 : UHF PULAY MINDO3 VECTORS DENSITY LOCAL T=300
L2 : EXAMPLE OF DATA FOR MOPAC
L3 : MINDO/3 UHF CLOSED-SHELL D2D ETHYLENE
L41: C
L42: C 1.400118 1
L43: H 1.098326 1 123.572063 1
L44: H 1.098326 1 123.572063 1 180.000000 0 2 1 3
L45: H 1.098326 1 123.572063 1 90.000000 0 1 2 3
L46: H 1.098326 1 123.572063 1 270.000000 0 1 2 3
L5 :
Keywords
Comments
Molecule Structure in Z-matrix (Internal Coordinate)
Blank Line End-of-Data
Title
April 7, 2000 11
Hartree-Fock Self Consistent Field
EH ˆ
SCFC
)]|()|[(
)]|()|(2[
21
*2
PH
ccHF
core
aaa
coren
Schrödinger equation
Matrix equation form (3.1.3)
Matrix representation of Fock matrix (3.1.4)
April 7, 2000 12
HF-SCF Procedure
S1: Calc molecular integrals O(n4)
S2: Guess initial eigenvector C
S3: Use C to compute F O(n4)
S4: Transform F to orthogonal basis O(n3)
diagonalize F to get a new C O(n3)
S5: Stop if C converged
S6: Guess new C and goto S3
SCFC
April 7, 2000 13
MOPAC computation
• Ab initio HF-SCF– evaluate all integrals rigorously– accuracy– requires high computing power– limited molecule size
• Semi-empirical HF-SCF– use the same procedure– reduce computational complexity– support larger molecule size
April 7, 2000 14
Semi-empirical SCF• Ignore some integrals
• Use experiment results to replace integrals
• Assume AO basis is orthogonal
S1, S3: O(n4)=>O(n2)
S4 orthogonalization not needed
New bottle neck: diagonalization
Complexity: O(n3)
April 7, 2000 15
Parallelization Procedure
• Sequential analysis– Time profiling analysis– Program flow analysis– Comp Complexity analysis
• Parallel analysis– Data dependence resolution– Loop parallelization
• Integration– Communication between modules
April 7, 2000 16
Sequential Analysis
• Time profiling analysis– Pick up the computational intensive parts– Usually use smaller input data
• Program flow analysis– Verify the chosen ones are commonly used– Domain expert not required
• Comp Complexity analysis– Workload distribution changed significantly for
different data sizes
April 7, 2000 17
MOPAC Sequential AnalysisData Size 1X 10X 100X
DCART O(n2) 7.42 0.85 0.09DENSIT O(n3) 15.89 18.10 18.36DIAG O(n3) 64.53 73.51 74.55HQRII O(n3) 6.01 6.85 6.94Sum 93.85 99.30 99.93Max Speed-up 16.26 142.74 1407.57Sum of O(n3) 86.43 98.45 99.84Max Speed-up 7.37 64.69 637.92
Assume the complexity of the rest part is O(n2)
April 7, 2000 18
Loop Parallelization
• Scalar forward subst: remove temp vars
• Induction variable subst: resolv depend
• Loop interchange/mergeenlarge granularity, reduce synchronization
• Scalar expansionresolve data dependence on scalars
• Variable copyingresolve data dependence on arrays
April 7, 2000 19
MOPAC Parallelization: DENSIT
• Function: compute density matrix
• 2 1-level loops inside of a 2-level loop
• Triangular computational space
• Merge the outer 2-level loop to 1 loop with range [1..n(n+1)/2]
• Lower comp/comm ratio (when n small)
• benefit from low latency communication when n is small
April 7, 2000 20
MOPAC Parallelization: DIAG
• P1: Generate Fock modular orbital matrix– Higher comp/comm ratio– Find global maximum TINY from local ones– Need to re-distribute matrix FMO for Part 2
• P2: 2X2 rotation to eliminate significant off-diagonal elements– “if” structure cause load imbalance– Need to exchange the inner most loop out– Some calculations run on all nodes to save comm
April 7, 2000 21
MOPAC Parallelization: HQRII
• Function: standard eigensolver
• R. J. Allen survey
• Use PNNL PeIGS pdspevx() function
• Use MPI communication library
• Small chunk data exchange, good if n/p>8
• Implemented in C, different way to pack matrix (row major)
April 7, 2000 22
Integration
April 7, 2000 23
Comm between Modules
• Parallel - sequential– Use TCP/IP– Auto upgrade to shared memory if possible
• Sequential - user interface– Input and output files– Application/Advanced Visualization System
(AVS) remote module communication
• User interface - display– AVS
April 7, 2000 24
MOPAC Cntl Panel & Module
April 7, 2000 25
MOPAC GUI
April 7, 2000 26
Data Files and Platform
• Platforms:– SGI Power Challenge– IBM SP2
Data file Lightatom
HeavyAtom
Data Sizenlight+4nheavy
1crn 0 327 1308
Vcop_4 279 169 955
C60_3 0 180 720
C60_2 0 120 480
porphyrn 33 58 265
April 7, 2000 27
DENSIT Speed-up
April 7, 2000 28
DENSIT Speed-up
Power Challenge SP2
April 7, 2000 29
DIAG Speed-up
April 7, 2000 30
DIAG Speed-up
Power Challenge SP2
April 7, 2000 31
HQRII Speed-up
April 7, 2000 32
HQRII Speed-up
Power Challenge SP2
April 7, 2000 33
Overall Speed-up
Projected, assuming sequential part is O(n2)
Power Challenge SP2
April 7, 2000 34
Overall SpeedupData File 16 Power
Challenge16 SP2 32 SP2
1crn 11.85610.015
13.83811.367
22.53116.511
Vcop_4 11.1709.068
12.80910.092
21.38014.598
C60_3 9.8457.810
12.2589.204
18.55112.227
C60_2 8.1746.309
10.1567.373
13.5448.928
porphyrn 5.7464.549
6.0424.722
6.6225.049
Assume non-parallelizable part is O(1) and O(n2)
April 7, 2000 35
Related work: IBMApplication: Conformational searchFocus: Throughput
April 7, 2000 36
Related work: SDSC
• Focus: performance
• Parallelizing:– Evaluate electronic repulsion integrals– Calculate first and second derivatives– Solve eigensystem
• Platform: 64-node iPSC/860
• Results: – Geometry optimization: speedup=5.2– Vibration analysis: speedup=40.8
April 7, 2000 37
Achievements
• Parallelize legacy apps from CS perspective
• Keep code validated
• Performance analysis procedures
• Predict large data performance
• Optimize parallel code
• Improve performance
• Improve user interface
April 7, 2000 38
Future Work
• Shared memory model
• Web based user interface
• Dynamic node allocation
• Parallelization of subroutines with lower computational complexity