Top Banner
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package MuQun Yang, Christian Chilan, Albert Cheng, Quincey Koziol, Mike Folk, Leon Arber The HDF Group Champaign, IL 61820
27

Parallel I/O Performance Study and Optimizations with HDF5 ... · 2 Outline Introduction to parallel I/O library HDF5,netCDF,netCDF4 Parallel HDF5 and Parallel netCDF performance

Jul 19, 2018

Download

Documents

tranmien
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Parallel I/O Performance Study and Optimizations with HDF5 ... · 2 Outline Introduction to parallel I/O library HDF5,netCDF,netCDF4 Parallel HDF5 and Parallel netCDF performance

Parallel I/O Performance Study and Optimizations with HDF5,

A Scientific Data Package

MuQun Yang, Christian Chilan, Albert Cheng, Quincey Koziol, Mike Folk, Leon Arber

The HDF GroupChampaign, IL 61820

Page 2: Parallel I/O Performance Study and Optimizations with HDF5 ... · 2 Outline Introduction to parallel I/O library HDF5,netCDF,netCDF4 Parallel HDF5 and Parallel netCDF performance

2

OutlineIntroduction to parallel I/O libraryHDF5,netCDF,netCDF4Parallel HDF5 and Parallel netCDFperformance comparisonParallel netCDF4 and Parallel netCDFperformance comparisonCollective I/O optimizations inside HDF5Conclusion

Page 3: Parallel I/O Performance Study and Optimizations with HDF5 ... · 2 Outline Introduction to parallel I/O library HDF5,netCDF,netCDF4 Parallel HDF5 and Parallel netCDF performance

3

I/O process for Parallel Application

P0 P1 P2 P3

• P0 becomes bottleneck• May exceed system memory

I/O library

File System

P0 P1 P2 P3

File System

I/O library I/O library I/O library I/O library

• May achieve good performance• Needs post-processing(?)• More work for applications

Page 4: Parallel I/O Performance Study and Optimizations with HDF5 ... · 2 Outline Introduction to parallel I/O library HDF5,netCDF,netCDF4 Parallel HDF5 and Parallel netCDF performance

4

I/O process for Parallel Application

P0 P1 P2 P3

Parallel File System

Parallel I/O libraries

P0 P1 P2 P3

Parallel File System

Parallel I/O libraries

Page 5: Parallel I/O Performance Study and Optimizations with HDF5 ... · 2 Outline Introduction to parallel I/O library HDF5,netCDF,netCDF4 Parallel HDF5 and Parallel netCDF performance

5

HDF5 VS netCDFHDF5 and netCDF provide data formats and programming interfacesHDF5

Hierarchical file structuresFlexible data modelMany Features

In-memory compression filtersChunked storageParallel I/O through MPI-IO

NetCDFLinear data layout

A parallel version of netCDF from ANL/Northwestern U. (PnetCDF) provide support for parallel access on top of MPI-IO

Page 6: Parallel I/O Performance Study and Optimizations with HDF5 ... · 2 Outline Introduction to parallel I/O library HDF5,netCDF,netCDF4 Parallel HDF5 and Parallel netCDF performance

6

Overview of NetCDF4Advantages:

New features provided by HDF5:More than one unlimited dimensionVarious compression filtersComplicate data type such as struct or array datatypeParallel IO through MPI-IO

NetCDF-user friendly APIsLong-term maintenance and distributionPotential larger user community

Disadvantage:Install HDF5 library

Page 7: Parallel I/O Performance Study and Optimizations with HDF5 ... · 2 Outline Introduction to parallel I/O library HDF5,netCDF,netCDF4 Parallel HDF5 and Parallel netCDF performance

7

An example for collective I/O

Every processor has a noncontiguous selection.Access requests are interleaved.Write operation with 32 processors, each processor selection has 512K rows and 8 columns (32 MB/proc.)

Independent I/O: 1,659.48 s.Collective I/O: 4.33 s.

P0 P1 P2 P3 P0 P2 P3P1

P0 P1 P2 P3

Row 1 Row 2

Row-major data layout

… …

Page 8: Parallel I/O Performance Study and Optimizations with HDF5 ... · 2 Outline Introduction to parallel I/O library HDF5,netCDF,netCDF4 Parallel HDF5 and Parallel netCDF performance

8

OutlineIntroduction to parallel I/O libraryHDF5,netCDF,netCDF4Parallel HDF5 and Parallel netCDFperformance comparisonParallel netCDF4 and Parallel netCDFperformance comparisonCollective I/O optimizations inside HDF5Conclusion

Page 9: Parallel I/O Performance Study and Optimizations with HDF5 ... · 2 Outline Introduction to parallel I/O library HDF5,netCDF,netCDF4 Parallel HDF5 and Parallel netCDF performance

9

Parallel HDF5 and PnetCDFperformance comparison

Previous Study:PnetCDF claims higher performance than HDF5

NCAR BlueskyPower4

LLNL uPPower5

PnetCDF 1.0.1 vs. HDF5 1.6.5.

Page 10: Parallel I/O Performance Study and Optimizations with HDF5 ... · 2 Outline Introduction to parallel I/O library HDF5,netCDF,netCDF4 Parallel HDF5 and Parallel netCDF performance

10

HDF5 and PnetCDF performance comparison

Benchmark is the I/O kernel of FLASH.FLASH I/O generates 3D blocks of size 8x8x8 on Bluesky and 16x16x16 on uP.Each processor handles 80 blocks and writes them into 3 output files. The performance metric given by FLASH I/O is the parallel execution time. The more processors, the larger the problem size.

Page 11: Parallel I/O Performance Study and Optimizations with HDF5 ... · 2 Outline Introduction to parallel I/O library HDF5,netCDF,netCDF4 Parallel HDF5 and Parallel netCDF performance

11

Previous HDF5 and PnetCDF Performance Comparison at ASCI White

(From Flash I/O website)

Page 12: Parallel I/O Performance Study and Optimizations with HDF5 ... · 2 Outline Introduction to parallel I/O library HDF5,netCDF,netCDF4 Parallel HDF5 and Parallel netCDF performance

12

HDF5 and PnetCDF performance comparison

Flash I/O Benchmark (Checkpoint files)

0

10

20

30

40

50

60

10 60 110 160

Number of Processors

MB

/s

PnetCDF HDF5 independent

uP: Power 5

Flash I/O Benchmark (Checkpoint files)

0

500

1000

1500

2000

2500

10 110 210 310

Number of Processors

MB/

s

PnetCDF HDF5 independent

Bluesky: Power 4

Page 13: Parallel I/O Performance Study and Optimizations with HDF5 ... · 2 Outline Introduction to parallel I/O library HDF5,netCDF,netCDF4 Parallel HDF5 and Parallel netCDF performance

13

HDF5 and PnetCDF performance comparison

Flash I/O Benchmark (Checkpoint files)

0

500

1000

1500

2000

2500

10 110 210 310Number of Processors

MB

/s

PnetCDF HDF5 collective HDF5 independent

Flash I/O Benchmark (Checkpoint files)

0

10

20

30

40

50

60

10 60 110 160

Number of Processors

MB

/s

PnetCDF HDF5 collective HDF5 independent

Bluesky: Power 4 uP: Power 5

Page 14: Parallel I/O Performance Study and Optimizations with HDF5 ... · 2 Outline Introduction to parallel I/O library HDF5,netCDF,netCDF4 Parallel HDF5 and Parallel netCDF performance

14

ROMSRegional Oceanographic Modeling SystemSupports MPI and OpenMPI/O in NetCDFHistory file writer in parallelData:

60 1D-4D double-precision float and integer arrays

Page 15: Parallel I/O Performance Study and Optimizations with HDF5 ... · 2 Outline Introduction to parallel I/O library HDF5,netCDF,netCDF4 Parallel HDF5 and Parallel netCDF performance

15

PnetCDF4 and PnetCDF performance comparison

020406080

100120140160

0 16 32 48 64 80 96 112 128 144

Number of processors

Ban

dwid

th (M

B/S

)PNetCDF collective NetCDF4 collective

• Fixed problem size = 995 MB• Performance of PnetCDF4 is close to PnetCDF

Page 16: Parallel I/O Performance Study and Optimizations with HDF5 ... · 2 Outline Introduction to parallel I/O library HDF5,netCDF,netCDF4 Parallel HDF5 and Parallel netCDF performance

16

ROMS Output with Parallel NetCDF4

0

50

100

150

200

250

300

0 16 32 48 64 80 96 112 128 144Number of Processors

Ban

dwid

th (M

B/S

)

Output size 995 MB Output size 15.5 GB

• The IO performance gets improved as the file size increases.• It can provide decent I/O performance for big problem size.

Page 17: Parallel I/O Performance Study and Optimizations with HDF5 ... · 2 Outline Introduction to parallel I/O library HDF5,netCDF,netCDF4 Parallel HDF5 and Parallel netCDF performance

17

OutlineIntroduction to parallel I/O libraryHDF5,netCDF,netCDF4Parallel HDF5 and Parallel netCDFperformance comparisonParallel netCDF4 and Parallel netCDFperformance comparisonCollective I/O optimizations inside HDF5Conclusion

Page 18: Parallel I/O Performance Study and Optimizations with HDF5 ... · 2 Outline Introduction to parallel I/O library HDF5,netCDF,netCDF4 Parallel HDF5 and Parallel netCDF performance

18

Improvements of collective IOsupports inside HDF5

Advanced HDF5 feature: non-regular selectionsPerformance optimizations: chunked storage

Provide several IO options to achieve good collective IO performanceProvide APIs for applications to participate in

the optimization process

Page 19: Parallel I/O Performance Study and Optimizations with HDF5 ... · 2 Outline Introduction to parallel I/O library HDF5,netCDF,netCDF4 Parallel HDF5 and Parallel netCDF performance

19

Improvement 1HDF5 non-regular selections

2-D array with the IO in shaded selections

• Only one HDF5 IO call• Good for collective IO

Page 20: Parallel I/O Performance Study and Optimizations with HDF5 ... · 2 Outline Introduction to parallel I/O library HDF5,netCDF,netCDF4 Parallel HDF5 and Parallel netCDF performance

20

HDF5 chunked storage

• Required for extendable data variables• Required for filters• Better subsetting access time

For more information about chunking:http://hdf.ncsa.uiuc.edu/UG41r3_html/Perform.fm2.html#149138

Performance issue:

Severe performance penalties with many small chunk IOs

Page 21: Parallel I/O Performance Study and Optimizations with HDF5 ... · 2 Outline Introduction to parallel I/O library HDF5,netCDF,netCDF4 Parallel HDF5 and Parallel netCDF performance

21

Improvement 2: One linked chunk IO

chunk 1 chunk 2 chunk 3 chunk 4

P0

P1

MPI-IOCollective View

One MPI Collective IO call

Page 22: Parallel I/O Performance Study and Optimizations with HDF5 ... · 2 Outline Introduction to parallel I/O library HDF5,netCDF,netCDF4 Parallel HDF5 and Parallel netCDF performance

22

Improvement 3: Multi-chunk IO Optimization

Have to keep the option to do collective IO per chunkCollective IO bugs inside different MPI-IO packagesLimitation of system memory

ProblemBad performance caused by improper use of collective IO

P0 P4P0

P1

P4

P5

P2

P3

P6

P7

P1 P5

P2 P6

P3 P7

Just use independent IO

Page 23: Parallel I/O Performance Study and Optimizations with HDF5 ... · 2 Outline Introduction to parallel I/O library HDF5,netCDF,netCDF4 Parallel HDF5 and Parallel netCDF performance

23

Improvement 4

ProblemHDF5 may not have enough information to

make the correct decision about the way to do collective IO

SolutionProvide APIs for applications to participate in

the decision-making process

Page 24: Parallel I/O Performance Study and Optimizations with HDF5 ... · 2 Outline Introduction to parallel I/O library HDF5,netCDF,netCDF4 Parallel HDF5 and Parallel netCDF performance

24

Flow chart of Collective Chunking IO improvements inside HDF5

Decision-making aboutOne linked-chunk IO

Multi-chunk IODecision-making about

Collective IO

One Collective IO call for all chunks Independent IO

Optional User Input

Optional User Input

Collective chunk mode

Yes

NO

Yes NO

Collective IO per chunk

Page 25: Parallel I/O Performance Study and Optimizations with HDF5 ... · 2 Outline Introduction to parallel I/O library HDF5,netCDF,netCDF4 Parallel HDF5 and Parallel netCDF performance

25

For the detailed about performance study and optimization inside HDF5:

http://hdf.ncsa.uiuc.edu/HDF5/papers/papers/ParallelIO/ParallelPerformance.pdf

http://hdf.ncsa.uiuc.edu/HDF5/papers/papers/ParallelIO/HDF5-CollectiveChunkIO.pdf

Page 26: Parallel I/O Performance Study and Optimizations with HDF5 ... · 2 Outline Introduction to parallel I/O library HDF5,netCDF,netCDF4 Parallel HDF5 and Parallel netCDF performance

26

ConclusionsHDF5 provides collective IO supports for non-regular selectionsSupporting collective IO for chunked storage is not trivial. Users can participate in the decision-making process that selects different IO options.I/O Performance is quite comparable when parallel NetCDF and parallel HDF5 libraries are used in similar manners. I/O performance of parallel NetCDF4 is compatible with parallel NetCDF with about 15% slowness in average for the output of ROMS history file. We suspect that the slowness is due to the software management when passing information from parallel NetCDF4 to HDF5.

Page 27: Parallel I/O Performance Study and Optimizations with HDF5 ... · 2 Outline Introduction to parallel I/O library HDF5,netCDF,netCDF4 Parallel HDF5 and Parallel netCDF performance

27

Acknowledgments

This work is funded by National Science Foundation Teragrid grants, the Department of Energy's ASC Program, the DOE SciDAC program, NCSA, and NASA.