The CRAY-1 Computer System Richard M. Russell Presented by Andrew Waterman ECE259 Spring 2008
Feb 09, 2016
The CRAY-1 Computer System
Richard M. RussellPresented by Andrew Waterman
ECE259 Spring 2008
Background
• CRAY-1 by no means first vector machine– 1960s: Westinghouse Solomon/ILLIAC IV– 1974: CDC STAR 100
• “I never, ever want to be a pioneer” --Cray– STAR 100, ILLIAC IV: who's this Amdahl dude?
• 1972: Cray Research formed after spat with CDC– Seymour Cray wanted to start from scratch on
8600; CDC brass, not so much• 1976: first CRAY-1 deployed at Livermore
CRAY-1 Hardware
Look Ma, No ASICs!
CRAY-1 Architecture
• 5-ton, vector uniprocessor• Word size = 64 bits• 80 MHz clock• 8MB RAM in 16 banks @ 20 MHz
– fcpu/fmem = 4 (!!)• Fairly RISCy 16- or 32-bit instructions
– Load/store; register-register operations
Scalar Operation and Octal Annoyance
• 108 A-registers for 24-bit address calculations
• 1008 B-registers serve as backing store for A-registers
• 108 S-registers for source/dest of scalar integer/FP insns
• T is to S as B is to A• 118 pipelined scalar FUs
– Address add, mult– Integer add, shift, logic, pop count– FP add, mult, reciprocal
Scalar Operation
• Protection without virtual memory– Base & limit address regs
• Ld $dest,$addr actually loads from $base+$addr• Program killed if $base+$addr >= $limit
• A handful of registers for interrupts, exceptions, etc.
OS and Front End
• cos (CRAY OS) handles job scheduling, storage management (tapes!), other I/O, checkpointing– Packaged with CAL (assembler)– ...and CFT (Fortran compiler), more later
• Command-line interface and job submission via separate front-end computer, e.g. VAX
Vector Operation (Finally!)
• 8x64-word V-registers• Vector Length Register
– Indicates # ops performed by vector insns– Set from contents of an A-register
• Vector Mask Register– Indicates which elements in vector to operate on– Set by vector test insns (e.g. VM[i] := ($Vk[i] == 0))
• 6 Vector FUs– integer add, shift, bitwise logic– FP via scalar FPU: add, mult, reciprocal
Vector Load/Store Architecture
• Big departure from STAR 100: register-register ops• CRAY-1 memory bandwidth == 80Mword/s ==
1word/cycle– If all 2-source insns are memory-memory, then
IPC=1/3! (and that assumes no bank conflicts!)– Solution: the RISC approach
• Combined with chaining (next), can sustain >> 1 flop/cycle
Chaining
• Pipeline bypass meets vectors• Consider SAXPY vector expression a*X+Y
– Slow approach: compute a*X (64 mults), then compute a*X+Y (64 adds)
• Total latency: 128+mult latency+add latency– since, in CRAY-1, all FUs are pipelined
– But... no fundamental serialization requirement• As soon as a*X[0] is computed, can compute
a*X[0]+Y[0]• Total latency: 64+mult latency+add latency
(speedup of almost 2)
Chaining Example
• Assume: 8-element vectors, single-cycle ops mul.ds $v2,$v3,$s1 add.d $v1,$v2,$v1• Without chaining: m m m m m m m m a a a a a a a a• With chaining: m m m m m m m m a a a a a a a a
Vector Startup Times
• For vector ops to be efficient enough to justify, startup overhead must be small
• CRAY-1 can issue a vector insn every cycle, assuming no structural hazards on FUs– Result: vector performance > scalar performance
for as few as four elements/vector
Cray Fortran Compiler (CFT)
• Important insight: hand-coding assembly sucks• The actual important insight: most vectorizable code
is of the embarrassingly-parallel variety– Even with 1970s compiler technology, innermost-
loop parallelism is low-hanging fruit– Exploit this—make the compiler do the heavy lifting
• CFT is pretty good for branchless inner loops• ...but doesn't even attempt to vectorize code with IFs
– So any use of the Vector Mask register must be hand-coded
• Upshot: a good start, but not quite there
Analysis
• Extremely fast computer for 1976• Thought experiment: what if CRAY-1's parameters
scaled with Moore's Law? (32 years == 21 doublings)– 200,000 transistors => 400 billion transistors– 8MB main memory => 16TB main memory– 80 MHz clock => petahertz? (if only)
• For a (merely) 2nd-generation vector processor, the CRAY-1 was ahead of its time (I think)– I'm not the only one: it was commercially
phenomenal• However, design techniques (discrete logic) are totally
unscalable
Questions?
Richard M. RussellPresented by Andrew Waterman
ECE259 Spring 2008
The CRAY-1 Computer System
Richard M. RussellPresented by Andrew Waterman
ECE259 Spring 2008