ISMS Keynote – 9 Feb 2015 A Model‐driven Approach for Time‐energy Performance of Parallel Applications Yong Meng TEO +* and Lavanya Ramapantulu Department of Computer Science National University of Singapore email: [email protected]url: www.comp.nus.edu.sg/~teoym + Visiting Professor, Chinese Academy of Sciences * Centre for Business Analytics, NUS
43
Embed
ISMS Keynote –9 Feb 2015teoym/pub/15/2015-Feb... · ISMS Keynote –9 Feb 2015 A Model‐driven Approach for Time‐energy Performance of Parallel Applications Yong Meng TEO+* and
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ISMS Keynote – 9 Feb 2015
A Model‐driven Approach forTime‐energy Performance of Parallel Applications
Yong Meng TEO+* and Lavanya RamapantuluDepartment of Computer ScienceNational University of Singaporeemail: [email protected]
url: www.comp.nus.edu.sg/~teoym
+ Visiting Professor, Chinese Academy of Sciences* Centre for Business Analytics, NUS
My InterestsRecent PhD Theses1. Specification and Verification of Shared‐memory Concurrent Programs,
Le Duy Khanh, Dec 2014.
2. Parallelism‐Energy Performance Analysis of Multicore Systems, B.M.Tudor, Jan 2014. [IPDPS 2012 PhD Forum Best Poster Award]
3. On Flash Crowd Performance of Peer‐Assisted File Distribution, C.Carbunaru, June 2014.
4. Strategy‐proof Resource Pricing in Federated Systems, M. Mihailescu,2012. [Best Paper Award ‐ 10th International Conference on Algorithmsand Architectures for Parallel Processing, May 2010]
5. Composable Simulation Models and their Formal Validation, ClaudiaSzabo, 2010. [ACM SIGSIM 2009 Best PhD Student Paper Award]
Faculty of Arts and Social SciencesSchool of BusinessSchool of ComputingFaculty of DentistrySchool of Design and EnvironmentFaculty of EngineeringFaculty of LawYong Loo Lin School of MedicineYong Siew Toh Conservatory of MusicSaw Swee Hock School of Public HealthFaculty of ScienceUniversity Scholars ProgrammeYale‐NUS CollegeLee Kuan Yew School of Public PolicyNUS Graduate School for Integrative Sciences & EngineeringDuke‐NUS Graduate Medical School Singapore
• Established July 1998 (formerly DISCS within FoS)
• Departments: – Computer Science – Information Systems
1. a typical Google cluster: spends most ofits time in 10‐50% utilization range ‐ amismatch between server workloadprofile and server energy efficiency
2. energy‐proportional system (ideal):energy consumed is proportional to theamount of work done
ISMS 2015 Keynote
Energy‐proportional Systems
• Even when power requirements scale linearly with theload, energy efficiency is not a linear function of load;idle system use 50% power
• Ideal system consumes no power when idle, very littlepower under a light load and, gradually, more power asthe load increases
• Dynamic power range: low and upper range of thepower consumption of a device– Processor (70%), DRAM (50%), disk drive (25%), networkswitches (15%), human(??%)
– wider range is better
9 Feb 2015 11
Is human‐being an energy‐proportional system?Is human‐being an energy‐proportional system?
• idle (70W), average (120W), peak (1‐2KW)
• dynamic power range = 1 – 70/1000 > 90%
ISMS 2015 Keynote
Research Questions
1. Can we replace traditional servers with low‐powernodes ? [SIGMETRICS2013]
2. How do we configure energy‐efficientheterogeneous clusters (data centers)?
[ICPP2014, IPDPS2015]
3. What is the cost of processing big data on low‐power servers? [VLDB2015]
4. Is dataflow a suitable model of computation andscheduling to scale‐out workload on low‐powerservers? [PACT2013 workshop]
9 Feb 2015 12ISMS 2015 Keynote
General Objective
To develop models and techniques for dynamic resourceprovisioning to achieve energy efficient computing whilemeeting performance deadline
Approach:1. generalized analytic performance model for
configuring application resource demand (this talk)2. technique for runtime provisioning using
polymorphic tasks3. …..
9 Feb 2015 ISMS 2015 Keynote 13
Time‐energy Performance
L. Ramapantulu, B.M. Tudor, D. Loghin, T. Vu and Y.M.Teo, Modeling the Energy Efficiency of HeterogeneousClusters, Proc of 43rd International Conference onParallel Processing, Minneapolis, USA, Sep 2014.
9 Feb 2015 14ISMS 2015 Keynote
Reducing Power: Wimpy vs Brawny Servers
9 Feb 2015 15
power [W
]
Performance [MFLOPS]
Brawny node
Wimpy node
Marginal improvement in performance at high power
High idle power
ISMS 2015 Keynote
Objective
How do we configure energy‐efficientheterogeneous clusters (data centers)?
Given an application with an energy budget andan execution time deadline, determine efficientconfigurations to run the application
9 Feb 2015 16ISMS 2015 Keynote
Motivating Example Configuring Heterogeneous Systems
9 Feb 2015 ISMS 2015 Keynote 17
What is the total number of possible configurations to run anapplication with ten AMD and ten ARM nodes?
Total = 36,380 configurations[mix configurations = 10 ARM nodes x 4 cores per ARM nodex 5 core frequencies per ARM node x 10 AMD nodes x 6 cores per AMDnodes x 3 core frequencies per AMD node = 36,000] + [AMD only = 10 x 6 x 3 = 180]+ [ARM only = 10 x 4 x 5 = 200]
Contributions
• Model‐driven approach: measurement‐basedanalytical model to determine energy efficientconfigurations on a mix of heterogeneous nodes– Meets a time deadline with minimum energy
• Our analysis shows that energy‐deadline Paretofrontier consisting of heterogeneous mixes is almostalways more energy‐efficient than homogeneousclusters
9 Feb 2015 18ISMS 2015 Keynote
Model‐driven Approach
9 Feb 2015 19
energy-efficient Pareto-optimal configurations
baseline measurement
Non-intrusive Baseline Execution
Time-Energy Performance Model
Applications
system parameters
workload parameters
Heterogeneous Systems
• onsiders different • considers different ISAs
• resource overlap• unifying unit of
work
ISMS 2015 Keynote
Applications
9 Feb 2015 20
Domain Program Problem Size
HPC EP 2,147,483,648 random numbers
Web Server memcached 600,000 GET/SET operations
Streaming video x264 600 frames 704 × 576
Financial Black‐scholes 500,000 stock options
Speech recognition Julius 2,310,559 samples
Web security RSA‐2048 5000 keys verifications
ISMS 2015 Keynote
Heterogeneous System
• ARM v7‐A Cortex‐A9• quad‐core, 0.2 to 1.4GHz
9 Feb 2015 21
• AMD K10, x86_64• six‐core, 0.8 to 2.1GHz
ISMS 2015 Keynote
Baseline Execution
• Measurements needed only for a single node, foreach type of node– non‐intrusive hardware performance counters
• Execute the program for a very small problem size– measure instructions, computation cycles and stall cycles– Eg. measure instructions per GET operation of memcached
• Execute micro‐benchmarks to measure active andstall power of processor cores
9 Feb 2015 22ISMS 2015 Keynote
Execution Time Model
9 Feb 2015 23
Parallel ApplicationnARM nAMD
match the execution rates between ARM and AMD nodesT(nARM) ≈ T(nAMD)
within a type of nodeworkload is equally divided
T(nARM) ≈
nARM
T 1 ≈ max( T , T / [CPU and I/O overlap]
ISMS 2015 Keynote
Execution Time Model
≈ , + ,
, ≈
, ≈
• stall cycles increase linearly with – increase in core clock frequency – increase in the number of cores
9 Feb 2015 24ISMS 2015 Keynote
Stalls due to Memory Contention
9 Feb 2015 25ISMS 2015 Keynote
Energy Model
• Total Energy = EARM × nARM + EAMD × nAMD
• Enode = E(core) + E(mem) + E(I/O) + E(idle)
• E(core) = Pcore,act × Tcore,work + Pcore,stall × Tcore,stall– power × time – uses execution time model – measured values for Pcore,act , Pcore,stall , PI/O– Pmem,act ,Pmem,stall for ARM and AMD from literature and spec.
1. Is heterogeneity better than homogeneity ?2. Are larger mixes of heterogeneous nodes
better ?3. …
9 Feb 2015 30ISMS 2015 Keynote
Heterogeneity versus Homogeneity
9 Feb 2015 31
(36,380)
ISMS 2015 Keynote
Heterogeneity versus Homogeneity
9 Feb 2015 32
(36,380)
ISMS 2015 Keynote
Heterogeneity versus Homogeneity
9 Feb 2015 33
Heterogeneity
• Enables a sweet region
• Saves more energy for a given deadline
ISMS 2015 Keynote
Are larger mixes better ?
9 Feb 2015 34
• Larger mixes are more energy efficient
• Enables more number of “sweet spots”
ISMS 2015 Keynote
Observations
1. Heterogeneity allows larger energy savingscompared to homogeneous systems.
2. Larger mixes increase the number ofconfigurations in the sweet region.
3. …
9 Feb 2015 35ISMS 2015 Keynote
Conclusions• measurement‐driven analytical model to determineenergy‐efficient configurations for a single workloadon a heterogeneous mix with different ISA’s
• heterogeneity is almost always more energy‐efficientthan homogeneity– But not for programs with large sequential fraction andhigh parallel overhead
L. Ramapantulu, B.M. Tudor, D. Loghin, T. Vu and Y.M.Teo,Modeling the Energy Efficiency of Heterogeneous Clusters,Proceeding of 43rd International Conference on ParallelProcessing, Minneapolis, USA, Sep 2014.
9 Feb 2015 36ISMS 2015 Keynote
Heterogeneous Low‐power Systems
1. Nov 2014 – 12‐node Heterogeneous CPU‐GPU Cluster (JetsonTK1) with 44 ARM cores & 2,304 GPU cores
1. D. Loghin, B.M. Tudor, H. Zhang, B.C. Ooi and Y.M. Teo, A Performance Study of BigData on Small Nodes, Proc of 41st International Conference on Very Large DataBases, Vol. 8, No. 7, Hawaii, USA, Aug 31‐Sep 4, 2015.
2. L. Ramapantulu, D. Loghin and Y.M. Teo, An Approach for Energy EfficientExecution of Hybrid Parallel Programs, Proceedings of 29th IEEE InternationalParallel & Distributed Processing Symposium, Hyderabad, INDIA, May 25‐29, 2015(acceptance 22%).
3. D. Loghin, B.M. Tudor and Y.M. Teo, An Approach for Direct Dataflow Executionon Contemporary Multicore Systems, Proc of 3rd International Workshop onDataflow Execution Models for Extreme Scale Computing, IEEE Computer SocietyPress, in conjunction with PACT2013, Edinburgh, Scotland, Sep 2013.
4. B.M. Tudor and Y.M. Teo, On Understanding the Energy Consumption of ARM‐based Multicore Servers, ACM SIGMETRICS, Carnegie Mellon University,Pittsburgh, USA, June 17 ‐ 21, 2013 [acceptance: 27 of 196] (featured article inHPCwire: Mapping the Energy Envelope of Multicore ARM Chips, 6 June 2013).
5. B.M. Tudor and Y.M. Teo, Towards Modeling Parallelism and Energy Performanceof Multicore Systems, Proc of 26th IEEE International Parallel & DistributedProcessing Symposium, Shanghai, China, May 21‐25, 2012. [PhD Forum Best PosterAward]
6. B. Tudor, Y.M. Teo and S. See, Understanding Off‐chip Contention of ParallelPrograms in Chip Multiprocessors, Proc. of 40th International Conference onParallel Processing, Taipei, Taiwan, Sep 2011 (acceptance 19%).