Debugging Slow Buffered Reads to the Lustre File System

1 February 2016

Debugging Slow Buffered Reads to the Lustre Filesystem By Robert Roy, Senior Staff Engineer

2 2

Direct IO reads are better than Buffered IO

The Problem

Seagate CS9000 with 4M RPCs Reads Buffered ~3.5 GB/s per OST Reads o_direct ~4.5 GB/s per OST Writes Buffered ~4.5 GB/s per OST

More clients do not produce more bandwidth

Suggests server side Data path on the server side is the same for o_direct and buffered IO

Suggests client side Buffered IO uses paged cache which is populated by readahead

Client side readahead is suspect

3 3

Readahead requests never ramp up to 4M RPCs

The Root Cause

[rroy@rroy-vm-wireshark ~]$ tshark -r buffered_1node_1thread.cap.gz -Tfields -e ip.src -e ip.dst -e lustre.obd_ioobj.ioo_id -e lustre.niobuf_remote.offset -e lustre.niobuf_remote.len -R lustre.niobuf_remote | head -10 172.19.62.138 172.19.55.5 1903 0 1048576 172.19.62.138 172.19.55.5 1903 1048576 2097152 172.19.62.138 172.19.55.5 1903 3145728 1048576 172.19.62.138 172.19.55.5 1903 4194304 1048576 172.19.62.138 172.19.55.5 1903 5242880 2097152 172.19.62.138 172.19.55.5 1903 7340032 1048576 172.19.62.138 172.19.55.5 1903 8388608 1048576 172.19.62.138 172.19.55.5 1903 9437184 2097152 172.19.62.138 172.19.55.5 1903 11534336 1048576 172.19.62.138 172.19.55.5 1903 12582912 1048576 ... 172.19.62.138 172.19.55.5 1903 1685061632 1048576 172.19.62.138 172.19.55.5 1903 1686110208 1048576 172.19.62.138 172.19.55.5 1903 1687158784 1048576 172.19.62.138 172.19.55.5 1903 1688207360 1048576 172.19.62.138 172.19.55.5 1903 1689255936 1048576 172.19.62.138 172.19.55.5 1903 1690304512 1048576 172.19.62.138 172.19.55.5 1903 1691353088 1048576 172.19.62.138 172.19.55.5 1903 1692401664 1048576 172.19.62.138 172.19.55.5 1903 1693450240 1048576 172.19.62.138 172.19.55.5 1903 1694498816 1048576

4 4

Even with a large 64MB IO size, all IO serviced from readahead is 1MB in size

The Root Cause

[rroy@rroy-vm-wireshark ~]$ tshark -r buffered_32node_4thread_64mIO.cap.gz -Tfields -e ip.src -e ip.dst -e lustre.obd_ioobj.ioo_id -e lustre.niobuf_remote.offset -e lustre.niobuf_remote.len -R lustre.niobuf_remote | grep 288 | head -n 20 172.19.62.138 172.19.55.4 2288 0 4194304 172.19.62.138 172.19.55.4 2288 4194304 4194304 172.19.62.138 172.19.55.4 2288 8388608 4194304 172.19.62.138 172.19.55.4 2288 12582912 4194304 172.19.62.138 172.19.55.4 2288 16777216 4194304 172.19.62.138 172.19.55.4 2288 20971520 4194304 172.19.62.138 172.19.55.4 2288 25165824 4194304 172.19.62.138 172.19.55.4 2288 29360128 4194304 172.19.62.138 172.19.55.4 2288 33554432 4194304 172.19.62.138 172.19.55.4 2288 37748736 4194304 172.19.62.138 172.19.55.4 2288 41943040 4194304 172.19.62.138 172.19.55.4 2288 46137344 4194304 172.19.62.138 172.19.55.4 2288 50331648 4194304 172.19.62.138 172.19.55.4 2288 54525952 4194304 172.19.62.138 172.19.55.4 2288 58720256 4194304 172.19.62.138 172.19.55.4 2288 62914560 4194304 172.19.62.138 172.19.55.4 2288 67108864 1048576 172.19.62.138 172.19.55.4 2288 68157440 1048576 172.19.62.138 172.19.55.4 2288 69206016 1048576 172.19.62.138 172.19.55.4 2288 70254592 1048576

5 5

The Source of the Problem

And right above that line…

/lustre/llite/rw.c

#define RAS_INCREASE_STEP(inode) (ONE_MB_BRW_SIZE >> PAGE_CACHE_SHIFT)

/* RAS_INCREASE_STEP should be (1UL << (inode->i_blkbits - PAGE_CACHE_SHIFT)). * Temporarily set RAS_INCREASE_STEP to 1MB. After 4MB RPC is enabled * by default, this should be adjusted corresponding with max_read_ahead_mb * and max_read_ahead_per_file_mb otherwise the readahead budget can be used * up quickly which will affect read performance significantly. See LU-2816 */

6 6

Set the increase step to the same value as the RPC size

The Solution

< #define RAS_INCREASE_STEP(inode) (ONE_MB_BRW_SIZE >> PAGE_CACHE_SHIFT) > #define RAS_INCREASE_STEP(inode) (PTLRPC_MAX_BRW_SIZE >> PAGE_CACHE_SHIFT)

7 7

Results

INCREASE_STEP RA File

RA MB Clients PPN IO Size Read Average

1MB 40 40 32 1 1M 6928.02 4MB 40 40 32 1 1M 8629.80 1MB 160 640 32 1 1M 7137.50 4MB 160 640 32 1 1M 9528.45

IOR -r -v -F –b 131072m -t 1m -i 3 -m -k -D 60

February 2016

Conclusion

9 9

Conclusion and More Information

Buffered reads can be improved significantly when 4m RPCs are in use Seagate implemented a parameter to address the issue

lctl set_param -n llite.*.read_ahead_step 4 https://github.com/Xyratex/lustre-stable/commit/2395f8e0e7e963aec43deb07d719e9229884758c

LU-7140 tracks the upstream work

https://jira.hpdd.intel.com/browse/LU-7140

Thank You

Questions?

February 2016

About Seagate

13

›  2+ million enclosures

›  17+Petabytes shipped

›  Drive Variety (HDD, SAS, SATA, SSD, hybrid)

›  Enclosures, controllers

›  Customer-driven partnership

›  Services: Logistics, fulfillment, warranty, design, supply chain

›  Purpose-engineered to optimize capacity and performance

›  40% fewer racks required

›  >1TB/sec file system performance

›  Solutions for object storage

›  Reference architectures for open source and software-defined storage

›  Private cloud appliances for backup and recovery

›  Modular, scalable components for DIY customers

Scale-Out Systems HPC OEM

Seagate Cloud Systems & Silicon Group

14

Powering the Fastest HPC Sites

Awards

Award-Winning ClusterStor Architecture

Debugging Slow Buffered Reads to the Lustre File System

Technology