8/12/2019 ALEX-gpfs
1/33
On evaluating GPFS
Research work that has been done at HLRS byAlejandro Calderon
8/12/2019 ALEX-gpfs
2/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 2
On evaluating GPFS
Short description
Metadataevaluation
fdtree
Bandwidthevaluation
Bonnie
Iozone
IODD
IOP
8/12/2019 ALEX-gpfs
3/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 3
GPFS descriptionhttp://www.ncsa.uiuc.edu/UserInfo/Data/filesystems/index.html
General Parallel File System (GPFS) is a parallel file system package developed by IBM.
History:
Originally developed for IBM's AIX operating system then ported to Linux Systems.
Features:Appearsto workjustlike a traditional UNIX file systemfromthe user application level.
Provides additional functionality and enhanced performancewhen accessed viaparallel interfaces such as MPI-I/O.
High performance is obtained by GPFSby striping data across multiple nodes and disks.
Striping is performedautomatically at the block level. Therefore, all files
(larger than the designated block size) will be striped.Can be deployed in NSD or SANconfigurations.
Clusters hosting a GPFS file systemcan allowother clustersat differentgeographical locations to mount that file system.
8/12/2019 ALEX-gpfs
4/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 4
GPFS (Simple NSD Configuration)
8/12/2019 ALEX-gpfs
5/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 5
GPFS evaluation (metadata)
fdtree Used for testing the metadata performance of a file system
Create several directories and files, in several levels
Used on:
Computers:
noco-xyz
Storage systems:
Local, GPFS
8/12/2019 ALEX-gpfs
6/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 6
fdtree [local,NFS,GPFS]
0
500
1000
1500
2000
2500
Operations/Sec.
Directory creates per
second
File creates per second File removals per
second
Directory removals per
second
./fdtree.bash -f 3 -d 5 -o X
/gpfs
/tmp
/mscratch
8/12/2019 ALEX-gpfs
7/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 7
fdtree on GPFS (Scenario 1)ssh {x,...} fdtree.bash -f 3 -d 5 -o /gpfs...
Scenario 1:
several nodes, several process per node,
different subtrees,
many small files
P1 Pm
nodex
8/12/2019 ALEX-gpfs
8/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 8
fdtree on GPFS (scenario 1)ssh {x,...} fdtree.bash -f 3 -d 5 -o /gpfs...
0
100
200
300
400
500
600
1n-1p 4n-4p 4n-8p 4n-16p 8n-8p 8n-16p
Operations/Se
c.
Directory creates per second
File creates per second
File removals per second
Directory removals per second
8/12/2019 ALEX-gpfs
9/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 9
fdtree on GPFS (Scenario 2)ssh {x,...} fdtree.bash -l 1 -d 1 -f 1000 -s 500 -o /gpfs...
Scenario 2:
several nodes, one process per node,
same subtree,
many small files
P1 Px
nodex
8/12/2019 ALEX-gpfs
10/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 10
fdtree on GPFS (scenario 2)ssh {x,...} fdtree.bash -l 1 -d 1 -f 1000 -s 500 -o /gpfs...
0
5
10
15
20
25
30
35
40
45
1 2 4 8
number of process (1 per node)
Filescreatesper
second
working in the same directory
working in different directories
8/12/2019 ALEX-gpfs
11/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 11
Metadata cache on GPFS client
hpc13782 noco186.nec 304$ time ls -als | wc -l894
real 0m0.466suser 0m0.010ssys 0m0.052s
Working in a GPFS directory with 894 entries
lslasneed to get each file attribute from GPFSmetadata server
In a couple of seconds, the contents of the cacheseams disappear
hpc13782 noco186.nec 305$ time ls -als | wc -l894
real 0m0.222suser 0m0.011ssys 0m0.064s
hpc13782 noco186.nec 306$ time ls -als | wc -l894
real 0m0.033suser 0m0.009ssys 0m0.025s
hpc13782 noco186.nec 307$ time ls -als | wc -l894
real 0m0.034suser 0m0.010ssys 0m0.024s
8/12/2019 ALEX-gpfs
12/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 12
fdtree results
Main conclusions
Contention at directory level:
If two o more process from a parallel application need to writedata, please be sure each one use different subdirectoriesfrom GPFS workspace
Better results than NFS (but lower that the local file system)
8/12/2019 ALEX-gpfs
13/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 13
GPFS performance (bandwidth)
Bonnie
Read and write a 2 GB file
Write, rewrite and read
Used on:
Computers:
Cacau1
Noco075
Storage systems:
GPFS
8/12/2019 ALEX-gpfs
14/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 14
Bonnie on GPFS [write + re-write]
0
20
40
60
80
100
120
140
160
180
bandwidth(M
B/sec.)
write
rewrite
write 51,86 164,69
rewrite 3,43 36,35
cacau1-GPFS noco075-GPFS
GPFS over NFS
8/12/2019 ALEX-gpfs
15/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 15
Bonnie on GPFS [read]
0
50
100
150
200
250
bandwidth(MB
/sec.)
read
read 75,85 232,38
cacau1-GPFS noco075-GPFS
GPFS over NFS
8/12/2019 ALEX-gpfs
16/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 16
GPFS performance (bandwidth)
Iozone Write and read with several file size and access size
Write and read bandwidth
Used on:
Computers:
Noco075
Storage systems: GPFS
8/12/2019 ALEX-gpfs
17/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 17
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
131072
262144
524288
4
32
256
2048
16384
0,00
200,00
400,00
600,00
800,00
1000,00
1200,00
Bandwidth(MB/s)
File size (KB)
RecL
en(bytes)
Write on GPFS
1000,00-1200,00
800,00-1000,00
600,00-800,00
400,00-600,00
200,00-400,00
0,00-200,00
Iozone on GPFS [write]
8/12/2019 ALEX-gpfs
18/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 18
Iozone on GPFS [read]
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
131072
262144
524288
4
16
64
256
1024
4096
16384
0,00
500,00
1000,00
1500,00
2000,00
2500,00
Bandwidth(M
B/s)
File size (KB)
RecL
en(bytes)
Read on GPFS
2000,00-2500,00
1500,00-2000,00
1000,00-1500,00
500,00-1000,00
0,00-500,00
GPFS l i (b d id h)
8/12/2019 ALEX-gpfs
19/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 19
GPFS evaluation (bandwidth)
IODD Evaluation of disk performance by using several nodes:
disk and networking
A dd-like command that can be run from MPI
Used on: 2, and 4 nodes,
4, 8, 16, and 32 process (1, 2, 3, and 4 per node)that write a file of 1, 2, 4, 8, 16, and 32 GB
By using both, POSIX interface and MPI-IO interface
next ->
8/12/2019 ALEX-gpfs
20/33
8/12/2019 ALEX-gpfs
21/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 21
IODD on 2 nodes [MPI-IO]
12
48
1632
8
2
0
20
40
60
80
100
120
140
160
180GPFS (writing, 2 nodes)bandwidth (MB/sec.)
process per node
file size (GB)
160-180140-160120-140100-12080-10060-8040-6020-40
0-20
8/12/2019 ALEX-gpfs
22/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 22
12
48
1632
8
2
0
20
40
60
80
100
120
140
160
180
GPFS (writing, 4 nodes)
bandwidth (MB/sec.)
process per node
file size (GB)
160-180
140-160
120-140
100-120
80-100
60-80
40-60
20-40
0-20
IODD on 4 nodes [MPI-IO]
8/12/2019 ALEX-gpfs
23/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 23
Differences by using different APIs
12
48
1632
16
8
4
2
1
0
20
40
60
80
100
120
140
160
180GPFS (writing, 2 nodes)
bandwidth (MB/sec.)
process per node
file size (GB)
160-180
140-160
120-140
100-120
80-100
60-8040-60
20-40
0-20
GPFS (2 nodes, POSIX) GPFS (2 nodes, MPI-IO)
12
48
32
16
8
4
21
0
10
20
30
40
50
60
70
process per node
file size (GB)
8/12/2019 ALEX-gpfs
24/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 24
IODD on 2 GB [MPI-IO, = directory]
1
24
8
16
32
0
20
40
60
80
100
120
140
160
Number of nodes
GPFS (writing, 1-32 nodes, same directory)
bandwidth (MB/sec.)
8/12/2019 ALEX-gpfs
25/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 25
IODD on 2 GB [MPI-IO, directory]
1
2
4
8
16
32
0
20
40
60
80
100
120
140
160
Number of nodes
GPFS (writing, 1-32 nodes, different directories)
bandwidth (MB/sec.)
IODD lt
8/12/2019 ALEX-gpfs
26/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 26
IODD results
Main conclusions
The bandwidth decrease with the number of processes per node
Beware of multithread application with medium-high I/O
bandwidth requirements for eachthread
It is very important to use MPI-IO because this API letusers get more bandwidth
The bandwidth decrease with more than 4nodes too With large files, the metadata management seams not to be the main
bottleneck
GPFS l ti (b d idth)
8/12/2019 ALEX-gpfs
27/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 27
GPFS evaluation (bandwidth)
IOP Get the bandwidth obtained by writing and reading in parallel
from several processes
The file size is divided between the process number so each
process work in an independent part of the file
Used on: GPFS through MPI-IO (ROMIO on Open MPI)
Two nodes writing a 2 GB files in parallel On independent files (non-shared)
On the same file (shared)
H IOP k
8/12/2019 ALEX-gpfs
28/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 28
How IOP works
2 nodes m= 2 process (1 per node)
n= 2 GB file size
a a .. b b .. x x ..
P1 P2 Pm
File per process (non-shared)
a b .. x a b .. x a b .. x
P1 P2 Pm
Segmented access (shared)
n n
O ff /
8/12/2019 ALEX-gpfs
29/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 29
IOP: Differences by using shared/non-shared
writing on file(s) over GPFS
0
20
40
60
80100
120
140
160
180
1KB
2KB
4KB
8KB
16KB
32KB
64KB
128KB
256KB
512KB
1MB
access size
Bandwidth(MB/sec.)
NON-shared
shared
IOP Diff b i h d/ h d
8/12/2019 ALEX-gpfs
30/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 30
reading on file(s) over GPFS
020
40
60
80
100
120
140
160
180200
1KB
2KB
4KB
8KB
16KB
32KB
64KB
128KB
256KB
512KB
1MB
access size
Bandwidth(MB/sec.)
NON-shared
shared
IOP: Differences by using shared/non-shared
8/12/2019 ALEX-gpfs
31/33
8/12/2019 ALEX-gpfs
32/33
acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 32
0
20
40
60
80
100
120
140
1MB
512KB
256KB
128KB
64KB
32KB
16KB
8KB
4KB
2KB
1KB
access size
bandwith(MB/sec)
write
read
Rread
Bread
GPFS writing in shared file:the 128 KB magic number
IOP results
8/12/2019 ALEX-gpfs
33/33
acaldero @ arcos inf uc3m es HPC Europa (HLRS) 33
IOP results
Main conclusions
Ifseveral process try to write to the same file but onindependent areas thenthe performance decrease
With several independentfiles results are similaron severaltests, but with shared fileare more irregular
Appears a magic number: 128 KBSeams that at that point the internal algorithm changes and it
increases the bandwidth