Parallelizing Legacy Applications in Message Passing Programming Model and the Example of MOPAC

Parallelizing Legacy Applications in Message

Passing Programming Model and the Example of MOPAC

Tseng-Hui (Frank) Lin

[email protected]

[email protected]

April 7, 2000 2

Legacy Applications

• Performed functions still useful

• Large user population• Invested big money• Rewriting is expensive• Rewriting is risky

• Changed through long time period

• Modified by diff people• Historical code• Dead code• Old concepts

• Major bugs fixed

April 7, 2000 3

What Legacy Applications Need

• Provide higher resolution

• Run bigger data

• Graphic representation for scientific data

• Keep certified

April 7, 2000 4

How to Meet the Requirements

Improve performance:

Parallel computing Keep Certified:

Change critical parts only• Better user interface:

Add GUI

April 7, 2000 5

System Configuration

April 7, 2000 6

Distributed vs Shared Memory

April 7, 2000 7

Message Passing Programming

• Non-parallelizable parts– Data dependent forces sequential execution– Not worthy to parallelize

• Workload distribution • Input data distribution • Distributed Computation

– Load balance

• Results collection

April 7, 2000 8

Non-Parallelable Parts

pT

P

pT

Sp

P

P

T

TS

)1(1

1

Amdahl’s Law

April 7, 2000 9

MOPAC

• Semi-empirical molecular orbital pkg– MNDO, MINDO/3, AM1, PM3

• MOPAC 3 submitted to QCPE in 1985

• MOPAC 6 ported to many platforms– VMS– UNIX (our work based on this version)– DOS/Windows

• MOPAC 7 is current

April 7, 2000 10

MOPAC input file

L1 : UHF PULAY MINDO3 VECTORS DENSITY LOCAL T=300

L2 : EXAMPLE OF DATA FOR MOPAC

L3 : MINDO/3 UHF CLOSED-SHELL D2D ETHYLENE

L41: C

L42: C 1.400118 1

L43: H 1.098326 1 123.572063 1

L44: H 1.098326 1 123.572063 1 180.000000 0 2 1 3

L45: H 1.098326 1 123.572063 1 90.000000 0 1 2 3

L46: H 1.098326 1 123.572063 1 270.000000 0 1 2 3

L5 :

Keywords

Comments

Molecule Structure in Z-matrix (Internal Coordinate)

Blank Line End-of-Data

Title

April 7, 2000 11

Hartree-Fock Self Consistent Field

EH ˆ

SCFC

)]|()|[(

)]|()|(2[

21

*2

PH

ccHF

core

aaa

coren

Schrödinger equation

Matrix equation form (3.1.3)

Matrix representation of Fock matrix (3.1.4)

April 7, 2000 12

HF-SCF Procedure

S1: Calc molecular integrals O(n4)

S2: Guess initial eigenvector C

S3: Use C to compute F O(n4)

S4: Transform F to orthogonal basis O(n3)

diagonalize F to get a new C O(n3)

S5: Stop if C converged

S6: Guess new C and goto S3

SCFC

April 7, 2000 13

MOPAC computation

• Ab initio HF-SCF– evaluate all integrals rigorously– accuracy– requires high computing power– limited molecule size

• Semi-empirical HF-SCF– use the same procedure– reduce computational complexity– support larger molecule size

April 7, 2000 14

Semi-empirical SCF• Ignore some integrals

• Use experiment results to replace integrals

• Assume AO basis is orthogonal

S1, S3: O(n4)=>O(n2)

S4 orthogonalization not needed

New bottle neck: diagonalization

Complexity: O(n3)

April 7, 2000 15

Parallelization Procedure

• Sequential analysis– Time profiling analysis– Program flow analysis– Comp Complexity analysis

• Parallel analysis– Data dependence resolution– Loop parallelization

• Integration– Communication between modules

April 7, 2000 16

Sequential Analysis

• Time profiling analysis– Pick up the computational intensive parts– Usually use smaller input data

• Program flow analysis– Verify the chosen ones are commonly used– Domain expert not required

• Comp Complexity analysis– Workload distribution changed significantly for

different data sizes

April 7, 2000 17

MOPAC Sequential AnalysisData Size 1X 10X 100X

DCART O(n2) 7.42 0.85 0.09DENSIT O(n3) 15.89 18.10 18.36DIAG O(n3) 64.53 73.51 74.55HQRII O(n3) 6.01 6.85 6.94Sum 93.85 99.30 99.93Max Speed-up 16.26 142.74 1407.57Sum of O(n3) 86.43 98.45 99.84Max Speed-up 7.37 64.69 637.92

Assume the complexity of the rest part is O(n2)

April 7, 2000 18

Loop Parallelization

• Scalar forward subst: remove temp vars

• Induction variable subst: resolv depend

• Loop interchange/mergeenlarge granularity, reduce synchronization

• Scalar expansionresolve data dependence on scalars

• Variable copyingresolve data dependence on arrays

April 7, 2000 19

MOPAC Parallelization: DENSIT

• Function: compute density matrix

• 2 1-level loops inside of a 2-level loop

• Triangular computational space

• Merge the outer 2-level loop to 1 loop with range [1..n(n+1)/2]

• Lower comp/comm ratio (when n small)

• benefit from low latency communication when n is small

April 7, 2000 20

MOPAC Parallelization: DIAG

• P1: Generate Fock modular orbital matrix– Higher comp/comm ratio– Find global maximum TINY from local ones– Need to re-distribute matrix FMO for Part 2

• P2: 2X2 rotation to eliminate significant off-diagonal elements– “if” structure cause load imbalance– Need to exchange the inner most loop out– Some calculations run on all nodes to save comm

April 7, 2000 21

MOPAC Parallelization: HQRII

• Function: standard eigensolver

• R. J. Allen survey

• Use PNNL PeIGS pdspevx() function

• Use MPI communication library

• Small chunk data exchange, good if n/p>8

• Implemented in C, different way to pack matrix (row major)

April 7, 2000 22

Integration

April 7, 2000 23

Comm between Modules

• Parallel - sequential– Use TCP/IP– Auto upgrade to shared memory if possible

• Sequential - user interface– Input and output files– Application/Advanced Visualization System

(AVS) remote module communication

• User interface - display– AVS

April 7, 2000 24

MOPAC Cntl Panel & Module

April 7, 2000 25

MOPAC GUI

April 7, 2000 26

Data Files and Platform

• Platforms:– SGI Power Challenge– IBM SP2

Data file Lightatom

HeavyAtom

Data Sizenlight+4nheavy

1crn 0 327 1308

Vcop_4 279 169 955

C60_3 0 180 720

C60_2 0 120 480

porphyrn 33 58 265

April 7, 2000 27

DENSIT Speed-up

April 7, 2000 28

DENSIT Speed-up

Power Challenge SP2

April 7, 2000 29

DIAG Speed-up

April 7, 2000 30

DIAG Speed-up

Power Challenge SP2

April 7, 2000 31

HQRII Speed-up

April 7, 2000 32

HQRII Speed-up

Power Challenge SP2

April 7, 2000 33

Overall Speed-up

Projected, assuming sequential part is O(n2)

Power Challenge SP2

April 7, 2000 34

Overall SpeedupData File 16 Power

Challenge16 SP2 32 SP2

1crn 11.85610.015

13.83811.367

22.53116.511

Vcop_4 11.1709.068

12.80910.092

21.38014.598

C60_3 9.8457.810

12.2589.204

18.55112.227

C60_2 8.1746.309

10.1567.373

13.5448.928

porphyrn 5.7464.549

6.0424.722

6.6225.049

Assume non-parallelizable part is O(1) and O(n2)

April 7, 2000 35

Related work: IBMApplication: Conformational searchFocus: Throughput

April 7, 2000 36

Related work: SDSC

• Focus: performance

• Parallelizing:– Evaluate electronic repulsion integrals– Calculate first and second derivatives– Solve eigensystem

• Platform: 64-node iPSC/860

• Results: – Geometry optimization: speedup=5.2– Vibration analysis: speedup=40.8

April 7, 2000 37

Achievements

• Parallelize legacy apps from CS perspective

• Keep code validated

• Performance analysis procedures

• Predict large data performance

• Optimize parallel code

• Improve performance

• Improve user interface

April 7, 2000 38

Future Work

• Shared memory model

• Web based user interface

• Dynamic node allocation

• Parallelization of subroutines with lower computational complexity

Parallelizing Legacy Applications in Message Passing Programming Model and the Example of MOPAC

Documents

Parallelizing Legacy Applications in Message Passing Programming Model and the Example of MOPAC