Top Banner
InfiniBand/RDMA for Storage – SRP vs. iSER Sebastian Riemer Linux Kernel Developer – Storage 23.05.2013
28

InfiniBand/RDMA for Storage - SRP vs. iSER

May 12, 2015

Download

Technology

This is the talk from Sebastian Parschauer (Riemer) on LinuxTag 2013. He is a Linux kernel developer in the storage team at ProfitBricks and develops storage solutions for the IaaS 2.0 cloud.
Especially the last slide about replication caused a lot of discussion.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: InfiniBand/RDMA for Storage - SRP vs. iSER

InfiniBand/RDMA for Storage –SRP vs. iSER

Sebastian RiemerLinux Kernel Developer – Storage

23.05.2013

Page 2: InfiniBand/RDMA for Storage - SRP vs. iSER

Structure

● RDMA Basics● RDMA Hardware

● InfiniBand, iWARP, RoCE● RDMA Software + Network Protocols● SRP vs. iSER

RDMA for Storage 2/28 23.05.2013

Page 3: InfiniBand/RDMA for Storage - SRP vs. iSER

RDMA Basics

RDMA for Storage 3/28 23.05.2013

Page 4: InfiniBand/RDMA for Storage - SRP vs. iSER

Remote Direct Memory Access (RDMA)

RDMA for Storage 4/28 23.05.2013

Page 5: InfiniBand/RDMA for Storage - SRP vs. iSER

Latency

RDMA for Storage 5/28 23.05.2013

e.g. 4k sync. reads, status/information requests, ...

Page 6: InfiniBand/RDMA for Storage - SRP vs. iSER

RDMA MTU

● RDMA MTU: 256, 512, 1024, 2048, 4096 Bytes● MTU : Throughput , Transfer Latency ● Max. MTU is settable● Active MTU is determined● InfiniBand: RDMA MTU is native● iWARP/RoCE: RDMA MTU must fit into Ethernet

MTU: 1500 → 1024 Bytes

RDMA for Storage 6/28 23.05.2013

Page 7: InfiniBand/RDMA for Storage - SRP vs. iSER

RDMA Hardware

RDMA for Storage 7/28 23.05.2013

Page 8: InfiniBand/RDMA for Storage - SRP vs. iSER

InfiniBand (IB)

● Switched fabric interconnect● Arbitrary topologies: Fat Tree, Mesh, Lash,...● Point-to-point bidirectional serial links● Used in HPC and Enterprise Data Centers● QDR 10 Gbit/s, FDR 14 Gbit/s per lane● Lanes: 4● Low end-to-end latency < 2 µs (1 GbE: 35 µs)

RDMA for Storage 8/28 23.05.2013

Page 9: InfiniBand/RDMA for Storage - SRP vs. iSER

InfiniBand (IB)

● Subnet Manager (SM)● LID (16 bit) and GID (128 bit) addressing● GID = 64 bit subnet prefix + 64 bit GUID● Max. 128 partitions (like VLANs)● QoS, reliability and scalability● Credit-based flow control → no packet loss

RDMA for Storage 9/28 23.05.2013

Page 10: InfiniBand/RDMA for Storage - SRP vs. iSER

InfiniBand Congestion

● Congestion Control (CC) not ready, yet● CC = tell SM to tell others to reduce their speed● Reduce MTU, set QoS, set IO limits, multipath

RDMA for Storage 10/28 23.05.2013

BLOCKED,NO CREDITS,

(tell SM)

master SM slave SM

Page 11: InfiniBand/RDMA for Storage - SRP vs. iSER

Host Channel Adapters (HCA)

● IB counterpart of NICs● Communicate via a Queue Pair (QP) constisting

of Send Queue (SQ) and Receive Queue (RQ)● Reliable/Unreliable, Connected/Disconnected ● Support for atomic operations● Error counters in HW

RDMA for Storage 11/28 23.05.2013

Page 12: InfiniBand/RDMA for Storage - SRP vs. iSER

Host Channel Adapters (HCA)

Mellanox QDRdriver: mlx4_ib

ConnectX-2 VPI

RDMA for Storage 12/28 23.05.2013

QLogic/Intel QDRdriver: qib

7300 Series

better for the DC/cloud

Page 13: InfiniBand/RDMA for Storage - SRP vs. iSER

Internet Wide Area RDMA Protocol (iWARP)

● RDMA Network Interface Card (RNIC)● Connection-oriented (TCP), only RDMA

technology routable through the Internet● Reliable Connected (RC) only● Latency, bandwidth: >= 3 µs, usually 10 Gbit/s● Vendors: Chelsio (driver cxgb3/4),

Intel NetEffect (driver nes)

RDMA for Storage 13/28 23.05.2013

Page 14: InfiniBand/RDMA for Storage - SRP vs. iSER

RDMA over Converged Ethernet (RoCE)

● Limited to a single Ethernet broadcast domain● InfiniBand frame encapsulation (IBoE)● GID is composed of MAC address + reserved● Better suited upon congestion● Scaling issues in big data center setups● Latency, bandwidth: < 2 µs, 10/40 Gbit/s● Vendors: Mellanox (driver mlx4_en),

Emulex (driver ocrdma),

RDMA for Storage 14/28 23.05.2013

Page 15: InfiniBand/RDMA for Storage - SRP vs. iSER

RDMA Software + Network Protocols

RDMA for Storage 15/28 23.05.2013

Page 16: InfiniBand/RDMA for Storage - SRP vs. iSER

OpenFabrics Enterprise Distribution (OFED)

● Approx. 30 SW packets● Upstream version: 3.5● IB Verbs: Hardware/OS abstraction layer● One IB verbs user-space driver per RDMA HW● IB Subnet Management (e.g. opensm)● Communication Management (CM)● Performance and diagnosis tools + utilities

RDMA for Storage 16/28 23.05.2013

Page 17: InfiniBand/RDMA for Storage - SRP vs. iSER

RDMA Network Protocols

● IP over InfiniBand (IPoIB)● iSCSI Extensions for RDMA (iSER)● SCSI RDMA Protocol (SRP)● Network File Systems (NFS-RDMA)● Distributed File Systems (GlusterFS, Lustre)

RDMA for Storage 17/28 23.05.2013

Page 18: InfiniBand/RDMA for Storage - SRP vs. iSER

SRP vs. iSER

RDMA for Storage 18/28 23.05.2013

Page 19: InfiniBand/RDMA for Storage - SRP vs. iSER

iSCSI Extensions for RDMA (iSER)

RDMA for Storage 19/28 23.05.2013

● SolarisCOMSTAR

● (LIO isert, kernel 3.10)

● STGTuser

kernel

● Mellanox pushes iSER and STGT

● No advanced features with STGT like live resizing

● ProfitBricks chose Solaris for ZFS and iSER

● LIO isert is too new

Target

Page 20: InfiniBand/RDMA for Storage - SRP vs. iSER

iSCSI Extensions for RDMA (iSER)

RDMA for Storage 20/28 23.05.2013

● ib_iser ● libiscsi● scsi_transport_iscsi● (ib_ipoib)

● iscsiduser

kernel

● Complexity● Multiple maintainers● Major IPoIB bugs● IP-based DDoS reconnect● Mellanox is mainly

improving performance● Too unstable for IB

open-iscsi Initiator

Page 21: InfiniBand/RDMA for Storage - SRP vs. iSER

SCSI RDMA Protocol (SRP)

RDMA for Storage 21/28 23.05.2013

● SCST ib_srpt● Solaris COMSTAR● (LIO ib_srpt)

user

kernel

● Very committed SCST maintainers Bart and Vlad (Bart Van Assche,Vladislav Bolkhovitin)

● ProfitBricks chose SCST due to ZFS and iSER issues

● LIO SRP unstable/unusable

Target

Page 22: InfiniBand/RDMA for Storage - SRP vs. iSER

SCSI RDMA Protocol (SRP)

RDMA for Storage 22/28 23.05.2013

● ib_srp● scsi_transport_srp

● (srp-tools)user

kernel

● Simplicity: RDMA-only, kernel-only possible

● Inactive Maintainer● No fast IO failing, no

continuous reconnect● Loosing SCSI disks● Bart + Mellanox are active● Bart's work doesn't fit us

Initiator

Page 23: InfiniBand/RDMA for Storage - SRP vs. iSER

ProfitBricks Choices

● Simplicity = Stablity → SRP without srp-tools● Help improving SCST● Improved SRP initiator ourselves

● Just fast IO failing + automatic reconnect● Never loose SCSI devices automatically

● Published SRP initiator fixes● Implement RDMA into QEMU for performance

RDMA for Storage 23/28 23.05.2013

Page 24: InfiniBand/RDMA for Storage - SRP vs. iSER

SRP Fixes

● From Bart: https://github.com/bvanassche/ib_srp-backport

● From ProfitBricks: https://github.com/sriemer/ib_srp

● Bart also has performance patches + backport● Bart uses the srp-tools + loosing SCSI devices● Gradually finding compromises

RDMA for Storage 24/28 23.05.2013

Page 25: InfiniBand/RDMA for Storage - SRP vs. iSER

● THCA_GUID="0002c903004ed0b2"

● TGID_P1="fe800000000000000002c903004ed0b3"

● PKEY="ffff"

● IHCA="mlx4_0"

● IHCA_P1="1"

● SRP=“id_ext=${THCA_GUID},ioc_guid=${THCA_GUID},dgid=${TGID_P1},pkey=${PKEY},service_id=${THCA_GUID}“

● echo "${SRP}" > /sys/class/infiniband_srp/srp-${IHCA}-${IHCA_P1}/add_target

Establish an SRP connection

RDMA for Storage 25/28 23.05.2013

Page 26: InfiniBand/RDMA for Storage - SRP vs. iSER

InfiniBand/RDMA Links/Information

● InfiniBand Trade Association(IB specification, doc, www.infinibandta.com)

● OpenFabrics Alliance (OFA, OFED providers, www.openfabrics.org)

● Mellanox Technologies (www.mellanox.com)● [email protected] mailing list● LinkedIn group „InfiniBand Technologists“

RDMA for Storage 26/28 23.05.2013

Page 27: InfiniBand/RDMA for Storage - SRP vs. iSER

Questions?

● Questions???

[email protected]● www.profitbricks.com

RDMA for Storage 27/28 23.05.2013

Page 28: InfiniBand/RDMA for Storage - SRP vs. iSER

Bonus: How to do replication right?

RDMA for Storage 28/28 23.05.2013

Primary Secondary Primary Primary LUN LUN

IP IP

ClusterManager

ClusterManager

WRONG!Store&ForwardWrites! Slow!

WRONG!Complex,

error-prone!

SRP/iSER/iSCSI

SRP/iSER/iSCSI

SRP/iSER/iSCSI

SRP/iSER/iSCSI

SRP/iSER/iSCSI

e.g. SW RAID-1

RIGHT!Simple

and fast!