Scalable Cluster Interconnect...Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006 1 626-821-5555 Fax: 626-821-5316 http:/ / Scalable Cluster Interconnect Overview and Technology

Post on 07-Sep-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

1Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Scalable ClusterInterconnect

Overview and Technology Roadmap

Charles L. Seitzchuck@myri.com

Linux Superclusters Users ConferenceAlbuquerque, NM13 September 2000

2Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

What is Myrinet?

• A high-performance, cost-effective, packet communicationand switching technology– ANSI Standard (ANSI/VITA 26-1998)

– Packets follow the route specified by the source host (sourcerouting).

– Processing power at the hosts and in the interfaces

– This architecture allows an elegant, streamlined, switchingtechnology

• A descendant of packet communication and routing inMPPs, but commodity and open

• Used principally for scalable clusters

3Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Myrinet Products and Applications

Myricom supplies all that is required to make a high-performance cluster from a collection of computers.

Software Host Interface

PCI Interfaces

Link Cables SAN (to 3m) Serial (to 10m) Fiber (to 200m) Long-wave Fiber

Cut-Through Switches

In-CabinetClusters

Desktop Hosts

VME Single-Board-Computer Clusters

Any NetworkTopology

2+2Gbits/s

4Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Myrinet Technology “in the Large”Sandia National Laboratory Cplant™

2,576 Compaq Alpha Personal Workstations,400 EV-5 + 768 EV-6 + 1408 EV-6, but not allin one cluster.

Compaq CustomSystems was the integrator.The system was built in three phases, in thesummers 1998, 1999, and 2000.

Cplant originally used 16-port Myrinet switchesin each 8-host cabinet. The latest increment usesa mesh variant of the M2LM-Clos64 “Networkin a Box” products for switching.

(Photo adapted from http://www.cs.sandia.gov/cplant/)

5Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Myrinet Technology “in the Small”CSPI Quad-PowerPC VME Signal-Processing Board

This CSPI two-level-multicomputer productuses the Myricom LANai-5 chip to

interface the PowerPCs tothe message-passing

network.

This single-width VMEboard includes a packet-switchedMyrinet network interconnecting the 4 nodes onthe board and 4 external ports with an 8-portMyrinet switch (a chip not visible in this photo).

6Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Why Myrinet? The “selling points.”• Low latency

– ~8.5µs today (UNIX user process touser process, fully protected, withend-to-end data integrity checking)

– The lower the latency, the wider theapplication span

• High data rate– 2+2 Gb/s shipping now

– 1.28+1.28 Gb/s legacy

– Copper and fiber links

• Unlimited scalability

• Very low host-CPU utilization– logP = ~1µs

• “Peg-the-needle” PCIimplementations

• High Availability features– Self-mapping, self-healing

– Link-continuity monitoring

• Data Integrity features– Memory and bus parity

– Link CRC

– Packet payload CRC

• More cost-effective than GigabitEthernet or Fibre Channel

– Cost per node < $1,500 today

– Cost per node < $1,000 soon

• Software drivers for all majorplatforms

– Download them from the Web

– Open source

7Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Myrinet = ANSI/VITA 26-1998

Myrinet is defined at the Data-Link level (level 2 of the ISO reference model for computer networks) by its packet format and flow control. Think of Myrinet as the simplest packet-switched network you can devise.

Sourcerouteusedby theswitches, which strip the bytes as they are used

Type (allows multiple protocols on one Myrinet)

Payload (any length)CRC

(Bytes)

http://www.myri.com/open-specs/

There are multiple Physical-level implementations.

8Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Myrinet Switches -- “Just Technology”

16-port 2nd-generation Myrinet switch (M2LM-SW16) with 8 SAN ports and 8 LAN ports

• 20.48 Gb/s bisection data rate (!) from a single-chip 16x16 crossbar.

• Path-formation latency 100ns SAN-SAN, 200ns SAN-LAN, 300ns LAN-LAN.

• 32 Watts, 2U rack mount size, no fan.

• SNMP/Ethernet monitoring & control (out of band) + Myrinet heartbeat.

• $5K US-list. The “workhorse” 2nd-generation Myrinet switch.

9Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

2nd-Generation Myrinet “Network in a Box”

Clos network of 16 16-port switches,with 64 LAN host ports, and 64 SANinter-switch ports.

Full (maximal) bisection data ratebetween the 64 host ports = 32 links(41+41 Gb/s). Data rate between thehost ports and the inter-switch ports =64 links (82+82 Gb/s).

160 Watts, 12U rack mount size

SNMP/Ethernet monitoring andcontrol, with the full set of Myrinethigh-availability features.

$40K US-list.

10Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Myrinet Interfaces

Network

InterfaceFast SRAM

RISC

DMA controller

& bus bridge

Packet

DMASANport

Parts of the LANai chipPCIDMA chip

M3M-PCI64B-2Universal 64/32-bit, 66/33MHzMyrinet-2000-SAN/PCI Interface

From a customer:“What makes Myrineteffective for clusters is

the autonomy of the interfaces,which lets us

get the OS out of the way.”

11Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Myrinet Software Interfaces

Applications

MPI Middleware

TCPUDP

IP

Ethernet Myrinet

Myrinet Control Program (MCP)

HostOS

OS-bypassAPIs (multiple host processes)

(executes in the Myrinet interface)

10/100/1000 Mb/s1280+1280 Mb/s

VIA

12Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

The GM Message-Passing System

No Compromises• Concurrent, protected,

user-level access

• Reliable, orderedmessage delivery

• Very low CPU overhead

• Robust under networkfaults

• Mapping

• Segmentation andreassembly of longmessages

• High-level flow control

• “Clean” API, withexception handling

• Zero-copy layering ofother APIs

GM Data-Rate Performance (Myrinet-2000 SAN Interfaces)

GM short-message latency (Myrinet-2000 interfaces)~ 8.5µs (best numbers)

GM CPU overhead = 1-2µs per message (LogP)

UNIX user process to user processFully protected

End-to-end data integrity

13Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

GM and MPICH-over-GM Latencies

UNIX user process to user processFully protected

End-to-end data integrity

14Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

MPICH over GM Data Rate

UNIX user process to user processFully protected

End-to-end data integrity

15Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Myrinet 2000 – Third-Generation MyrinetThis evolutionary step improves the links at the Physical level -- boththe performance and the “look and feel” of Myrinet --, and introducesinterfaces with 1.7x and 2.5x faster RISCs, but Myrinet-2000 iscompatible with 2nd-generation Myrinet at the Data Link leveland in the software. (Don’t try to innovate along too manydimensions at once! This is a technology push, not an architecturechange.)

SAN-1280 SAN-2000 Circuit boards & ribbon cables (3m)

LANSerial copper HSSDC, 2+2 Gb/s to 10m

Low-cost fiber Multimode fiber, 2+2 Gb/s to 200m

16Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Myrinet-2000

• 2+2 Gb/s links using the same Physical mediaand signaling as {2.5GbE, 2xFC, & 1xInfiniBand}.– HSSDC cables to 10m and low-cost fiber to 200m.

• 64/32-bit, 66/33MHz, Myrinet/PCI interfaces(LANai 9)– 132 MHz RISC, 1,056 MB/s local-memory data

rate (achieves 8.5µs GM latency)

– In 1Q01, 200MHz RISC, 1,600 MB/s local-memory data rate (~6.5µs GM latency)

• Modular Switches– 16-port crossbar and 32|64|128-host Clos switches,

with line-card options for SAN, serial, or fiberlinks.

17Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Myrinet-2000 128-Host “Network in a Box”

This family of products support hot-plugging of line cards, fans, and dual redundant powersupplies. Microcomputer monitoring (SNMP over Ethernet) provides extensive diagnosticcapabilities, and management features needed for high-availability applications.

Different types of line cards have Serial, Fiber, SAN, or legacy LAN ports

Spine of the Clos Network (backplane)

8 hosts

8 hosts

8 hosts

8 hosts

8 hosts

8 hosts

8 hosts

8 hosts

8 hosts

8 hosts

8 hosts

8 hosts

8 hosts

8 hosts

8 hosts

8 hosts

Closspreadernetwork

Ports to up to 128 hosts (line cards)

18Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

The family of Myrinet-2000 switch products

Clos“spreader”

network(128 links)

8 16-portswitcheson the

backplane Up to 1616-portswitches

on the linecards

17-slotenclosureup to 128

hosts

Clos“spreader”

network(64 links)

4 16-portswitcheson the

backplane

Up to 8 16-portswitches

on the linecards

9-slotenclosureup to 64

hosts

…(32 links)

2 16-portswitcheson the

backplane

Up to 416-portswitches

on the linecards

5-slotenclosureup to 32

hosts

One line cardwith a 16-port

switch, and onestraight-through

line card

3-slotenclosureup to 16

hosts

Add the optional monitoring line card to provide SNMP/Ethernet monitoring andcontrol. The monitoring line card includes a microcontroller and dual Ethernetports. All line cards are interchangable across the product family.

19Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Why Clos Networks?; Maximal performance under arbitrary traffic patterns

; Minimum bisection is the largest possible; “Rearrangable Network” (can route any permutation); Network looks the same from any host (simplifies cluster management)

; Multiple paths; All progressive routes are deadlock-free; Use multiple paths for redundancy; Use multiple paths to avoid hot spots (random dispersion)

; Scales well. For n hosts (minimum bisection = n /2):; Diameter varies as log(n); Cost varies as nlog(n); Modular

; Economies of sharing the power supply and microcontroller betweenmany switches, and implementing many of the inter-switch links oncircuit boards rather than cables.

20Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Myrinet Technology – History & Roadmap

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

1st Generation0.64+0.64 Gb/s links

2nd Generation1.28+1.28 Gb/s links

3rd Generation“Myrinet 2000”2+2 Gb/s links

32-bit SBus (SPARC) interfaces, 8-port switches

32-bit PCI interfaces (LANai 4), 8-port switches

SAN PHY level

Clos “network in a box” of 8-port switches

16-port switches, HA features

64-bit PCI interfaces (LANai 7), GM message system

Clos “network in a box” of 16-port switches

64-bit PCI interfaces (LANai 9), SW16, Clos128

PCI-X, multiple virtual channelsGigabit Ethernet ports on Myrinet switches

PastFuture

Full Interoperability with 1x InfiniBand

4x InfiniBand links

Products & Features

21Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Myrinet Technology Roadmap

• In mid-2001, PCI-X interfaces– PCI-X is not only 2x faster than 66MHz PCI, PCI-X allows concurrent,

interleaved transactions.

• Also in mid-2001, multiple virtual channels.– Allows “express lanes” for latency-sensitive traffic.

– Coordinated with PCI-X, because today’s PCI would otherwise get in the wayof latency-sensitive transactions.

– Required later for full interoperability with InfiniBand.

• Programmable bridges/routers between {Myrinet, Gigabit Ethernet,InfiniBand} with “Myrinet inside.”

• Support or converge with InfiniBand.– We have all of the necessary technology now for the PHY layer.

– Track and support the protocols and APIs in firmware.

top related