Connection Machine Architecture Greg Faust, Mike Gibson, Sal Valente CS-6354 Computer Architecture Fall 2009 1
Connection MachineArchitecture
Greg Faust, Mike Gibson, Sal Valente
CS-6354 Computer Architecture
Fall 2009
1
Historic Timeline
• 1981: MIT AI-Lab Technical Memo on CM• 1982: Thinking Machines Inc. Founded • 1985: Danny Hillis wins ACM “Best PhD” Award• 1986: CM-1 Ships• 1987: CM-2 Ships• 1991: CM-5 Announced• 1991: CM-5 Ships• 1994: TMI Chapter 11 – Sun/Oracle pick bones• Heavily DARPA funded/backed
$16M+ Direct Contracts plus subsidized CM sales
2
Involved Notables
• Danny Hillis – CM inventor and TMI Founder• Charles Leiserson – Fat tree inventor• Richard Feynman – Noble Prize winning Physicist• Marvin Minsky – MIT AI Lab “Visionary”• Guy Steele – Common Lisp, Grace Hopper Award• Stephen Wolfram – Mathematica inventor• Doug Lenat – Mind/Body problem philosopher• Greg Papadopoulos – MIT Media lab, Sun CTO• various others
3
CM-1 and CM-2 Architecture
• Original design goal to support neuron like simulations• Up to 64K single bit processors (actually 3 bits in and 2 out)• 16 Processors/chip, 32chips/PCB, 16 PCBs/cube, 8cubes/hypercube• Hypercube architecture – Each 16-Proc chip a hyper-node• Each proc has 4K bits of bit addressable RAM
– Distributed Physical Memory – Global Memory Addresses
• Up to 4 front-end computers talk to sequencers via 4x4 crossbar• “Sequencers” issue SIMD instructions over a Broadcast Network• Bit procs communicate via 2D local HW grid connections (“NEWS”)• Bit procs communicate via hypercube network using MSG passing• Lots of Twinkling Lights!!
4
CM-1 and CM-2 Programming
• ISA supports:– Bit-oriented operations– Arbitrary precision multi-bit scalar Ops
using bit-serial implementation on bit procs– Full Multi-Dimensional Vector Ops
• “Virtual Processor” idea similar to CUDA threadsbut they are statically allocated
• OS and Programming Tools run on front-ends• *Lisp as the initial programming language• Later C* and CM-Fortran
6
CM-2 Improvements
• 1 Weitek IEEE FP coprocessor per 32 1-bit procs
• Up to 256K bits of memory per processor
• Added ECC to Memory
• Implemented the IO subsystem– Up to 80 GByte RAID array called “Data Vault”
uses 39 Striped Disks and ECC, plus spare disks on standby
– High Speed Graphics Output
• En-route MSG combining in H-Cube router
• New implementation of Multi-DimensionalNEWS on top of H-Cube (special addressing mode)
7
CM-5 vs CM-1 and CM-2
• Significant departure from CM-1 and CM-2
• Targeted at more scientific and business applications
• More Commercial Off-The-Shelf components (“COTS”)
• Large Array of SPARC Processing Nodes
– 1-bit processors are abandoned
• Abandoned “NEWS” Grid and Hyper-Cube Networks
• Delivered 1024 node machine, with claims 16K nodes possible
• Even More Twinkling Lights!
9
CM-5 Overall Architecture
• "Coordinated Homogeneous Array of RISC Processors“ or “CHARM”
• Asymmetric CoProcessors Model– Large Array of Processor Nodes
– Small Collection of Control Nodes
• 2 Separate scalable networks– One for data
– One for control and synchronization
• Still uses striped RAID for high disk BandWidth
11
Division of Labor
• Processor Nodes can be assigned to a “Partition”
• One Control Node per Partition
• Control Node runs scalar code, then broadcasts parallel work to Processor Nodes
• Processor Nodes receive a program, not an instruction stream, have own Program Counter
• Processor nodes can access other node's memory by reading or writing a global memory address
• Processor Nodes also communicate via MSG passing
• Processor Nodes cannot issue system calls
12
Control Nodes
• Full Sun Workstations
• Running UNIX
• Connected to the “Outside World”
• Handles Partition Time Sharing
• Connected to both data and control networks
• Performs System Diagnostics
13
Processor Nodes
• Nodes are a 5-chip microprocessor–Off the Shelf SPARC processor @ 40 MHz
–32MBytes local node memory
–Multi-port memory controller for added BW
– “Caching techniques do not perform as well on large parallel machines”
–Proprietary 4-FPU Vector coprocessor
–Proprietary network controller
14
Data Network Architecture
• Point to Point Inter-node communication and I/O• Implemented as a Fat Tree
– Fat Trees invented by TMI employee Charles Leiserson
• Claim: Onsite BandWidth Expandable• Delivering 5GB/sec Bisection BW on 1024 node machine• Data router chip is a 8x8 crossbar switch• Faulty nodes are mapped out of network
– Programs can not assume a network topology
• Network can be flushed when Time Share swaps occur• Network, not processors, guarantee end to end delivery
16
Separate Control Network
• Synchronization & control network
• Complete Binary Tree organization
• Provides broadcast capability
• Implements barrier operations
• Implements interrupts for timesharing
• Performs reduction operators (Sum, Max, AND, OR, Count, etc)
18
CM-5 Programming
• Supports multiple Parallel High Level Languages and Programming Styles
– Including Data Parallel Model from CM-1 and CM-2
• Goal: Hide many decisions from programmers
– CM-1, CM-2 vs CM-5 ISA changes
– Use of Processor Node CPU vs Vector CoProcessors
– Partition Wide Synchronizations generate by Compiler
• Is it MIMD, SPMD, SIMD?
– “Globally Synchronized MIMD”
19
Sample CM Apps
• Machine Learning– Neural Nets, concept clustering, genetic algorithms
• VLSI Design• Geophysics (Oil Exploration), Plate Tectonics• Particle Simulation• Fluid Flow Simulation• Computer Vision• Computer Graphics , Animation• Protein Sequence Matching• Global Climate Model Simulation
20
References
• Danny Hillis PhD: The Connection Machine
• Inc: The Rise and Fall of Thinking Machines
• Wiki: Connection Machine
• ACM: The CM-5 Connection Machine
• ACM: The Network Architecture of the CM-5
• IEEE: Architecture and Applications of the Connection Machine
• IEEE: Fat-trees: universal networks for hardware-efficient supercomputing
• Encyclopedia of Computer Science and Technology
21