www-inst.eecs.berkeley.edu/~cs152/. CS 152 Computer Architecture and Engineering. Lecture 22 -- GPU + SIMD + Vectors I. 2014-4-15 John Lazzaro (not a prof - “John” is always OK). TA: Eric Love. Play:. Today: Architecture for data parallelism. The Landscape: Three chips that deliver - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
256-bit version -> double-precision vectors of length 4
Exception Model
MXCSR: AVX
condition codes
register
Floating-point exceptions: Always a contentious issue in ISA design ...
Exception Handling
Use MXCSRto configureAVX to halt
program for divide by
zero, etc ...
Or, configure AVX for show must go onsemantics: on error,
results are set to +Inf, -Inf, NaN, ...
Data movesAVX register file reads pass through a permute
and shuffle networks in both “X” and “Y” dimensions.
Many AVX instructions rely on this feature ...
Pure data
move opcode.
Or, part of a
math opcode.
Permutes over 2 sets of 4 fields
of one vector.
Arbitrary data
alignment
Shuffling two vectors.
Memory System
Gather: Reading non-unit-stride memory locations into arbitrary positions in an AVX register, while minimizing redundant reads.
Values in memory.Specified indices.
Final result.
Positive observations ...
Best for applications that are a good fit for Xeon’s memory system: Large on-chip caches, up-to-a-TeraByte of DRAM, but only moderate bandwidth requirements to DRAM. Applications that do “a lot of everything” --integer, random-access loads/stores, string ops -- gain access to a significant fraction of a TeraOp/sof floating point, with no context switching.If you’re planning on experimenting with GPUs,you need a Xeon server anyway ...aside from $$$, why not buy a high-core-count variant?
Negative observations ...
AVX changes each generation, in a backward compatible way, to add the latest features. AVX is difficult for compilers. Ideally, someone has written a library of hand-crafted AVX assembly code that does exactly what you want.Two FMA units per core (50% of issue width) is probably the limit. So, scaling vector size or scaling core count are the only upgrade paths.