GPU Enhancements for Noise, Vibration and Harshness (NVH) … · 2013. 3. 21. · This session will describe recent algorithmic and implementation advancements used for real world

MSC Software Confidential

GPU Enhancements for Noise, Vibration

and Harshness (NVH) Analysis

Dr. Ted Wertheimer

MSC Software Confidential MSC Software Confidential

20 Million DOF - 3.9 M elements

2 3/20/2013


• This model extracted many modes:

• up to 1500 Hz structure -> ~26500 modes

• up to 1500 Hz fluid -> ~3200 modes

• Large frequency range: 0 to 1024 Hz in 2048 frequency steps

20 Million DOF

3 3/20/2013

# Nodes DMP SMP Elapsed Time

4 16 * 4 4:58:09


94 Million DOF

4 3/20/2013


• Automated Component Modal Synthesis

(ACMS)

• MSC Nastran model is automatically divided

into N domains

• Executes in parallel using Distributed Memory

Parallel (DMP)

– Shared Memory Parallel (SMP) provides additional

speedup

ACMS


1 2 3 4 6 7 8 9 10 11 12 13 14 15 16

0

25

21 23 22 24

26

20 19 18 17

30

28 27

Master

Slave 2

Slave 1

Slave 3

29

Example with DMP=4

ACMS Domain Decomposition

5


• Multi-CPU, multi-core parallel scalability

• 2X performance increase from 2010

MSC Nastran ACMS – Automotive Models

0

200

400

600

800

serial 12 CPUs serial 12 CPUs serial 12 CPUs serial 12 CPUs

Case 1 Case 2 Case 3 Case 4

ACMS)

2010

2011.1

2011.22012


• Up to 3X faster for exterior acoustics

– Exterior acoustics

– Brake squeal

– Friction

– Rotordynamics

Nonsymmetric Solver Performance

0

200

400

600

800

1000

1200

1400

1600

1800

2000

fr resp total job

Case 3

Exterior acoustics

2011.1

2011.22012


Improved Performance for Acoustics

• Efficient Participation Factor

3 Times Faster

MSC Nastran 2012 MSC Nastran 2010


• Nastran direct equation solver is GPU accelerated – Sparse direct factorization (MSCLDL, MSCLU)

• Real, Complex, Symmetric, Un-symmetric

– Handles very large fronts with minimal use of pinned host memory • Lowest granularity GPU implementation of a sparse

direct solver; solves unlimited sparse matrix sizes

– Impacts several solution sequences: • High impact (SOL101, SOL108), Mid (SOL103), Low

(SOL111, SOL400)

MSC Nastran 2013

10


• Support of multi-GPU and for Linux and Windows – With DMP> 1, multiple fronts are factorized

concurrently on multiple GPUs; 1 GPU per matrix domain

– NVIDIA GPUs: Tesla K20/K20X, Tesla M2090, Tesla

C2075, Quadro 6000 – CUDA 5.0

MSC Nastran 2013

11


Direct sparse solver workflow

in MSC Nastran (MSCLDL, MSCLU)

3/20/2013

In a proper order, do the

following at each node.

Assembly

Pivoting

Block factorization:

from Global Stiffness &

contribution blocks

11

9 10

8

6 7

5

3 4

1 2

Most time-consuming matrix update operations on GPU

Off-diagonal

update

Diagonal

decomposition Schur Complement

Trailing matrix update


Block LU Decomposition

Direct solves are (typically) performed using Block LU

decomposition

Spend most of their time computing the Schur Complement

Compute bound / low hanging fruit

A11 A12

A21 A22

0

L21 I

I 0

0 A22 –

L21U12 0

= * *

U12

I

L11 U11

DGEMM

DTRSM DPOTRF DPOTRF

DTRSM

L11 U11 = A11 L11 U12 = A12 L21 U11 = A21


PCIe limit on Schur complement calculation.

(DGEMM)

• PCIe limts GPU performance

• Host is faster for small fronts

• Requires nRank >700 for full perf on K20

• M2090 and K20 are same until nRank

>300


0

1.5

3

4.5

6

SOL101, 2.4M rows, 42K front SOL103, 2.6M rows, 18K front

serial 4c 4c+1g

MSC Nastran 2013

SMP + GPU acceleration of SOL101 and SOL103

Higher is

Better

Server node: Sandy Bridge E5-2670 (2.6GHz), Tesla K20X GPU, 128 GB memory

1X 1X

2.7X

1.9X

6X

2.8X

Lanczos solver (SOL 103) Sparse matrix factorization

Iterate on a block of vectors

(solve)

Orthogonalization of vectors


0

200

400

600

800

1000

serial 1c + 1g 4c (smp) 4c + 1g 8c(dmp=2)

8c + 2g(dmp=2)

NVH with MSC Nastran 2013

Coupled Structural-Acoustics simulation with SOL108

1X

Lower is Better

Europe Auto OEM 710K nodes, 3.83M elements

100 frequency increments

(FREQ1)

Direct Sparse solver

4.8X

2.7X

5.2X 5.5X

11.1X

Server node: Sandy Bridge 2.6GHz, 2x 8 core, Tesla 2x K20X GPU, 128GB memory

Ela

psed

Tim

e in M

inu

tes


MSC Nastran 2013:

Solution Price-Performance Gain


0

20

40

60

80

serial smp 4c smp 4c+1g(x1 node)

dmp 4c+1g(x2 nodes)

dmp 4c+1g(x3 nodes)

Elap

sed

Tim

e in

Ho

urs

NVH with MSC Nastran 2013 Trimmed Car Body Frequency Response with SOL108

Server node: Sandy Bridge 2.6GHz, 2x 8 core, Tesla 2x K20X GPU, 128GB memory

1X

2.5X

Lower is Better

USA Auto OEM 1.2M nodes, 7.47M DOF

Shells (CQUAD4): 1.04M

Solids (CTETRA): 0.1M

100 frequency increments

(FREQ1)

4.4X

6.8X 9X


• Japan Auto OEM – Nodes 1.4M, Elements 0.78M

• Mainly TETRA10

– Modes: 104 (2500 Hz )

– Front size: 23,718

NVH with MSC Nastran 2013

Engine Model Modal Frequency with SOL111

2848

1000

614

586

2807

901

2303

2168

0

2000

4000

6000

8000

10000

1CPU(9052sec.)

1CPU+1GPU(5116sec.)

CPU Time

Tim

e(s

ec.)

FBS+Matrix-vectorMultply

Shift+Decomposition

Sparse Decomposition

only

335 239

2856

1027

6180

4120

291

223

0

2000

4000

6000

8000

10000

12000

1CPU(9702sec.)

1CPU+1GPU(5647sec.)

Elaps Time

Tim

e(s

ec.)

Pre_Eigenvalue

Eigenvalue

Resvec

Post_Eigenvalue

1.7x speedup


• Marc multi-frontal sparse solver is GPU accelerated – Marc Solver type 8

• Support of multi-GPU and for Linux and Windows – Recommend 1 GPU per DDM

Marc 2012

3/20/2013


0

200

400

600

800

1000

1200

1400

1600

1800

Serial 1c + 1gpu

nps=2 nps=2, 2gpus

nps=4, 2gpus

Marc 2012 - Automotive Engine model (1M DOF)

Marc 2012 – GPU Acceleration

Customer model

6.5X Speedup with 2 GPUs over Serial run

DOF: 1M

Elements: 170K


Marc 2012 – GPU Acceleration of US Auto OEM

model

22 3/20/2013

Speed Up – End to End

2.5 Million Elements

10 Million DOF

Nonlinear Bolt Tightening

48 Iterations

0

0.5

1

1.5

2

2.5

3

Serial (1c) 4c 1c+1 GPU


Conclusions

• GPUs provide for significant performance acceleration for direct

solver intensive large jobs, ie. max front > 10000 for real data and

> 5000 for complex data models.

• Multiple GPU performance is available with DMP>1 including for

NVH SOL108 (embarrassingly parallel).

• NVIDIA and MSC continue to work together to tune BLAS and

LAPACK kernels for MSCLDL and MSCLU.

• As Models become larger the value of GPGPU becomes Greater

23


Thank You

24 3/20/2013

GPU Enhancements for Noise, Vibration and Harshness (NVH) … · 2013. 3. 21. · This session will describe recent algorithmic and implementation advancements used for real world

Documents