Network File System (NFS) in High Performance Networks PDLSVD08-02 Overview Wittawat Tantisiriroj, Garth Gibson • NFS over RDMA was recently released in February 2008 • What is the value of RDMA to storage users? • Competing networks • General purpose network (e.g. Ethernet) • High-performance network with RDMA (e.g. InfiniBand) NFS over IPoIB / NFS over RDMA • IPoIB: Implemented as a standard network driver • RDMA: Implemented as a new RPC type Experiment Setup • For point-to-point throughput, IP over InfiniBand (Connected Mode) is comparable to a native InfiniBand Experimental Results • When a disk is a bottleneck, NFS can benefit from neither IPoIB nor RMDA • As the number of concurrent read operations increases, aggregate throughputs achieved for both IPoIB and RDMA significantly improve with no disadvantage for IPoIB • When a disk is not a bottleneck, NFS benefits significantly from both IPoIB and RDMA • RDMA is better than IPoIB by ~20% NFSv2 IP NFSv3 Open Network Computing Remote Procedure Call (ONC RPC) TCP NFSv4 RDMA CM InfiniBand/iWarp Card TCP RPC Ethernet Card Driver Ethernet Card RDMA Verbs IB CM RDMA RPC iWarp CM NFSv4.1 TCP TCP RPC IP IPoIB Driver InfiniBand Card NFS Server 1) Gigabit Ethernet Switch 2) Infiniband SDR Switch NFS Client 1-4 1) ext2 on a single 160GB SATA 2) tmpfs on 4GB RAM 0 100 200 300 400 500 600 700 800 900 1000 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M Throughput (MB/s) byte per send() call Point to Point Throughput with iPerf and ib_bw InfiniBand (Reliable Connection) – ib_bw Socket Direct Protocol (SDP) – iPerf IP over InfiniBand (Connected Mode) – iPerf IP over InfiniBand (Connected Mode) w/ 2 streams – iPerf IP over InfiniBand (Datagram Mode) – iPerf 1 Gigabit Ethernet – iPerf 61.54 53.02 54.40 26.92 45.96 25.20 36.76 25.33 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 write read Throughput (MB/s) Sequential Write to Server’s Disk – Read from Server’s Disk 5 GB File with 64 KB per Operation (ext2 on Single 160GB SATA) local disk eth ipoib rdma 60.26 59.52 107.93 57.39 397.76 30.67 479.25 0.00 100.00 200.00 300.00 400.00 500.00 600.00 700.00 800.00 900.00 1000.00 write read Throughput (MB/s) Sequential Write to Server’s Disk – Read from Server’s Cache 512 MB File with 64 KB per Operation (ext2 on Single 160GB SATA) local disk eth ipoib rdma 2636.51 ib_send_bw ~940 iperf(ipoib) ~650 iperf(eth) ~110 108 405 456 112 429 600 112 453 583 111 717 635 112 719 699 111 759 758 0 100 200 300 400 500 600 700 800 900 1000 eth ipoib rdma Aggregate Throughput (MB/s) Parallel Sequencial Read from Server's Cache 512 MB File with 64 KB per Operation (ext2 on Single 160GB SATA) 1 client x 1 thread 1 client x 2 threads 1 client x 4 threads 2 clients x 1 thread 2 clients x 2 threads 4 clients x 1 thread iperf(ipoib -2s) ~980 iperf(ipoib) ~650 ib_send_bw ~940 iperf(eth) ~110 Gigabit Ethernet 10 Gigabit Ethernet Infiniband 4X SDR Infiniband 4X SDR Infiniband 4X SDR 1 10 8 16 32 40 40 4 4 4 40 1,350 600 720 1,200 ~Price per NIC+Port ($) ~Latency (μs) Bandwidth (Gbps) Type Source: High-Performance Systems Integration group, Los Alamos National Laboratory (HPC-5, LANL) IRHPIT INSTITUTE FOR RELIABLE HIGH PERFORMANCE INFORMATION TECHNOLOGY