OFED for Linux: Status and Next Steps www.openfabrics.org 1 Betsy Zeller (Qlogic), Tziporet Koren (Mellanox) 3/16/2010
Dec 29, 2015
OFED for Linux: Status and Next Steps
www.openfabrics.org 1
Betsy Zeller (Qlogic), Tziporet Koren (Mellanox)3/16/2010
Agenda
• Linux OFED components• Releases completed in 2009:
– OFED 1.4.1 & 1.4.2– OFED 1.5
• Releases planned for 2010:– OFED 1.5.1– OFED 1.5.2/3
• Future releases:– OFED 1.6 and beyond
• How can you contribute?• Questions and discussion
– Planning for scalability and discussion on MPI inclusion
2www.openfabrics.org
Linux OFED Components
• HCA/NIC Drivers – IB: IBM, Mellanox, QLogic– iWARP: Chelsio, Intel– RoCEE: Mellanox
• Core: Verbs, mad, SMA, CM, CMA• IPoIB• SDP• SRP and SRP Target• iSER and iSER Target• RDS• NFS-RDMA • Qlogic_VNIC• uDAPL• OSM• Diagnostic tools
Bonding module Open iSCSI MPI Components
MVAPICH Open MPI MVAPICH2 Benchmark tests
Proprietary MPIs: Intel, Platform MPI
Proprietary SMs: Sun, Voltaire, Qlogic, Mellanox
OFA Development Add on
Tested with
3
OFED 1.4.1• Released on June 3, 2009• Distro integration:
– Integrated into RHEL 5.4– SLES 11
• New features – Added support for RHEL 5.3 and SLES11– NFS/RDMA: In beta quality with support for RHEL 5.2, 5.3 and SLES 10 SP2– Updated MPI packages: MVAPICH 1.1.0-3355, Open MPI 1.3.2– Updated bonding package: ib-bonding-0.9.0-40– Updated DAPL: compat-dapl-1.2.14 and dapl-2.0.19– Updated OpenSM version to include critical bug fixes– Fixed RDS iWARP support– Low level drivers updated: ehca, mlx4, cxgb3, nes, ipath, mthca– Added a module parameter to control number of MTTs per segment in Mellanox HCAs– mstflint update– Enhanced OpenSM and management tools, user interface, HA, routing enhancements,
and more
• Used in Intel ® Cluster Ready Solutions 4
OFED 1.4.2
• Released on Aug 6, 2009– Critical bug fixes, including: – Fixes to NES (Intel iWarp) driver– Fixes to support running with Lustre installed– NFS/RDMA critical bug fixes (still technology preview)
• Distro integration:– RHEL 5.5 1.4.2+– SLES 11 SP1 1.4.2 (kernel from 2.6.32)
• Passed Oracle 11g certification with RDS
www.openfabrics.org 5
OFED 1.5
• Released on December 30, 2009– Kernel base: 2.6.30
• To be included in RHEL 5.6• Used in Intel ® Cluster Ready Solutions• List of Supported Kernels for OFED 1.5
– RHEL4: up6, up7, up8 – RHEL5: up2, up3, up4 – SLES10: SP2, SP3 – SLES 11 – Fedora Core 11* – OpenSuSE 11* – Kernel.org: 2.6.18-2.6.32** minimal QA for these versions.
www.openfabrics.org 6
OFED 1.5 – New features• New OSes: RedHat EL5.4, EL4.8 and SLES 10 SP 3• Kernel Base: 2.6.30
– New supported kernels: 2.6.29-2.6.32– Forward port to support kernels above 2.6.30
• Hardware driver (qib) for new Qlogic QDR HCA• Added ConnectX2 support to mlx4 driver• User space packages as tar balls for easier distro integration
– All libraries available under: http://www.openfabrics.org/downloads/
• SDP:– Zero Copy (beta); more performance improvements
• Bonding: Included only for older distros that does not include the IB support– RHEL 5.3 / SLES 10 SP2 or older version
7
1.5 – New features – Cont.
• uDAPL: – Scalability enhancements– New UCM provider– Common code base with WinOF 2.1.
• MPI updates:– OSU MVAPICH 1.2.0– Open MPI 1.4– OSU MVAPICH2 1.4– MPI tests 3.2
• NFS-RDMA in beta• Many bug fixes
www.openfabrics.org 8
1.5 – Open SM new features
Scalability & performance
• Optimized SL2VL setup
• Parallel LFTs setup
• Parallel MFTs setup
Routing & multicast
• FTree improvements
• Routing engine reloading
• Mesh switch reordering
optimizations
• MGID to MLID compression
QoS improvements
SL2VL setup optimization
QoS/LASH co-exist
Major bug fix
MCG join/leave fixes
Clean delayed MCG deletion
9
OFED 1.5.1
• To be released within the next week• Main new features:
– Added RoCEE support– Updated Open MPI to rev 1.4.2– Updated MVAPICH2 to rev 1.4-2– Updated DAPL to rev 2.0.26– Added extended atomic operations to ConnectX (kernel)– Improved NFS/RDMA stability
• Need more testing
– Added IPv6 support to RDMACM– Bug fixes
www.openfabrics.org 10
OFED 2010 Plans
• Proposal:– Provide additional dot releases throughout the year
• 1.5.2 – Jun• 1.5.3 – Sep
– Delay 1.6, and changing the kernel base, till Q1 2011
• Pros: – Improve the stability of OFED 1.5.x, faster bug fix turnaround– Integrate support for new distros as they become available– Logo program: Vendors will be able to fix issues found
• Cons: – Many patches in the fixes directory
• Opinions? …
www.openfabrics.org 11
OFED 1.5.2 New Features
• Add new OSes: RHEL 5.5, SLES11 SP1• Update the management package• Add-on packages that does not touch the core• New low level drivers for new HW• Critical bug fixes
• Schedule: June/July– Decide in EWG meetings
* OFED 1.5.3 – will decide if needed after 1.5.2www.openfabrics.org 12
OFED 1.6 New Features
• kernel base 2.6.35 or 36• Virtualization and SRIOV support• Mellanox Vnic/vHBA for BridgeX• New HW from vendors (if any)• New scalability features (if any)• Soft-RoCEE & Soft-iWarp drivers
– If ready can go into 1.5.x before
• New management features– 3D torus support by OpenSM
• MMU notifier for MPI – Solution that will be accepted by the kernel
• Improve connection management scheme?• More features – Based on input from customers
www.openfabrics.org 13
OFED 1.6 Schedule & OS support
• OS support reduction– Stop supporting old OSes (e.g. RHEL 4.x, SLES 10)
• Schedule suggestion:– Start to work during Q4 2010– Release in Q1 2011
• Opinions?
www.openfabrics.org 14
Reminder: How do you contribute?
• Develop new code and features• Bug fixes• Performance tuning• Contribute backports for new OSes and forward ports for
new kernels• QA and testing• Send patches and comments to the mailing lists:
– [email protected] – OFED for Linux specific only– [email protected] – General Linux development
• Open bugs in Bugzilla (https://bugs.openfabrics.org/)– Choose OpenFabrics Linux when opening a new bug– Test old bugs with new releases and update on bugzilla
• Participate in EWG bi-weekly meetings15
Open Discussion
• Should we have a scalability roadmap for OFED? How do we plan for scalability?
• Should we continue to include MPIs in the OFED releases?
• What’s the next step on MMU notifier kernel support for MPI?
• Questions and feedback from the community
16
Scalability Challenges
• Scale out to 10K-20K or more nodes– Performance– Reliability– Subnet management
• Focus additional attention/resources on issues– Should we have a BOF on this?
www.openfabrics.org 17
Infrastructure Scalability/Features • Improve multicore affinity/awareness• Multicast• Flow control for SRQ• Fault tolerance• IB monitoring
– Performance counters, throughput, hotspots, degraded links
• Adaptive routing– HCA out-of-order delivery– Switch logic for state info & adaptive algorithm, etc.
• Path Record Support• RDMA CM
18
MPI Distribution in OFED: Rationale
• Open source MPI’s initially included to “bootstrap” the OFED project
• MPI was the main user for OFED, so this seemed like a natural pairing
– Made it (significantly) easier for customers to get their MPI jobs running on InfiniBand
– However, distros don’t like it – they package their own
• QA testing of MPI + OFED is still extremely valuable
– This is not a discussion of removing MPI + OFED QA
• How many use MPI from OFED, vs compiling own?19
MPI Distribution in OFED: Pros
• MPI is still the most common OFED “customer”
• HPC customers get network stack + MPI in one package
– Helps rapid MPI deployment on new clusters (out-of-box)
– MPI-selector function allows to select MPI stack of choice during the
installation
• Customers get QA assurance of specific MPI + OFED
version tuples
• Helps to test multiple functionalities of the OFED stack and
IB/iWARP fabric with comprehensive suite of MPI-level
benchmarks20
MPI Distribution in OFED: Cons
• MPI’s have their own QA cycles
– MPI+OFED QA testing is more for OFED, not MPI
• Bundling induces project scheduling difficulties between OFED and various MPI packages
• RedHat and SuSE both say “Don’t do this!”
– They both already include the open source MPI’s
– Makes it more difficult for them to take OFED drops
• Many users will download the latest-n-greatest MPIs anyway – not the ones included in OFED
21
User Level MMU Notification
• Userlevel registration caches need to be notified when registered memory is freed– Problem for middleware that hides memory registration– For example: MPI
• Roland prototyped “ummunotify”– Jeff Sq. adapted Open MPI to use ummunotify– …but ummunotify was rejected by the kernel maintainers– KM’s want same functionality on performance counters
• MPI’s still need this functionality– Roland not available; Jeff Sq. will do the MPI work– Who can do the kernel side?
www.openfabrics.org 22
Questions and feedback from the community
www.openfabrics.org 23
Thank You!
www.openfabrics.org 24