Slide 1 Slide 2 Slide 3 Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA) Slide 4 //... stuff... x[0] = y[0]; // 128b copy x[1] = y[1]; //…