Department of Informa-on Technology, Uppsala University h9p://it.uu.se/ UART Uppsala Architecture Research Team How Can We Improve DAE? DAE = Decoupling Uses from Loads Selec-ng Loads: Crea-ng Alterna-ve Access Phases Why DAE? DAE + Frequency Scaling Saves Energy Evalua-ng Versions at Run-me and Selec-ng the Best Performing One Overcoming Address Recomputa-on: Hois-ng Loads to Access Phases Info Problem: Hois-ng all loads Into an Access Phase Is infeasible: address computa-on overhead diminishes benefits! So)ware Decoupled Access-Execute Kim-Anh Tran, Konstan-nos Koukos, Stefanos Kaxiras, Alexandra Jimborean Which loads to select for prefetching? How to avoid address recomputa-on? 1 2 1 2 Run>me Evalua>on Phase: evaluate each combina-on! Pick best combina-on for remaining itera-ons A 0 E E E A 1 E A 1 E A 1 A 1 A 2 E Orig ... Run-time Iteration 0 10 20 30 40 50 60 slice 0 slice 1 slice 2 slice 3 slice 4 slice 5 ... Loop Slice Original DAE CPU frequency f opt CPU frequency f max f min Execu�on Memory-bound: run on low frequency Compute-bound: run on high frequency Original code: compromise frequency to balance between energy and performance. DAE helps to save energy by adjus�ng the frequency. vs. Problem: Address recomputa3on in the Execute Phase is expensive. Create one Access Phase for each level of indirec-on: Orig Original Code Legend n-th Access Version Execute Phase (same for all A n ) A n E Run each version (Original and alterna-ve Access-Execute versions) for a couple of itera>ons was iden-fied as the best performing version: se9le on this combina-on! for (...) { } unroll * 2 L 1 =ld x[i] L 2 =ld y[i] L 1 +L 2 for (...) { in-place DAE L 1 =ld x[i] L 2 =ld y[i] L 1 +L 2 } L 3 =ld x[i+1] L 4 =ld y[i+1] L 3 +L 4 for (...) { L 1 =ld x[i] L 2 =ld y[i] L 1 +L 2 } L 3 =ld x[i+1] L 4 =ld y[i+1] L 3 +L 4 Register Transfer of data from Access to Execute via registers Access Phase Execute Phase for (...) { } for (...) { x[i] for (...) { for (...) { Version 1: 1-indirec�on a[x[i]] b[a[x[i]]] y[i] Indirec�on Legend Access Phase Execute Phase or or Version 2: 2-indirec�on Version 3: 3-indirec�on x[i] y[i] } x[i] a[x[i]] y[i] x[i] a[x[i]] b[a[x[i]]] y[i] } } for (...) { } for (...) { } for (...) { } for (...) { } for (...) { } Access Phase Execute Phase Memory Access Computa�on Cache prefetch into cache use data for (...) { } Legend Why DAE within one loop? By applying DAE within one loop, we may transfer data via registers: Access phase loads data into a register, Execute phase directly uses register. Why unrolling? Unrolling is required as we now apply DAE within the loop Reference: A. Jimborean, K. Koukos, V. Spiliopoulos, D. Black-Schaffer, S. Kaxiras. Fix the code. Don’t tweak the hardware: A new compiler approach to Voltage- Frequency scaling. In Proc. of CGO’14 Contact: [email protected]