Impact of Memory Frequency Scaling on User-centric ... · All measurements were conducted on a Samsung Galaxy S4 (i9500, Android 4.2.2) device with an Exynos 5410 chipset (8-core/2-clusters:

This is a repository copy of Impact of Memory Frequency Scaling on User-centric Smartphone Workloads.

White Rose Research Online URL for this paper:http://eprints.whiterose.ac.uk/125334/

Version: Accepted Version

Proceedings Paper:Mendis, Hashan Roshantha, Chen, Wei-Ming, Soares Indrusiak, Leandro orcid.org/0000-0002-9938-2920 et al. (2 more authors) (2018) Impact of Memory Frequency Scaling on User-centric Smartphone Workloads. In: Proceedings of the 33rd ACM/SIGAPP Symposium on Applied Computing (SAC 2018). .

[email protected]://eprints.whiterose.ac.uk/

Reuse Unless indicated otherwise, fulltext items are protected by copyright with all rights reserved. The copyright exception in section 29 of the Copyright, Designs and Patents Act 1988 allows the making of a single copy solely for the purpose of non-commercial research or private study within the limits of fair dealing. The publisher or other rights-holder may allow further reproduction and re-use of this version - refer to the White Rose Research Online record for this item. Where records identify the publisher as the copyright holder, users can verify any specific terms of use on the publisher’s website.

Takedown If you consider content in White Rose Research Online to be in breach of UK law, please notify us by emailing [email protected] including the URL of the record and the reason for the withdrawal request.

mailto:[email protected]

https://eprints.whiterose.ac.uk/

Impact of Memory Frequency Scaling on User-centricSmartphone Workloads

Hashan R. Mendis1, Wei-Ming Chen2, Leandro Soares Indrusiak1, Tei-Wei Kuo2, Pi-Chen Hsiu3

1Real-time Systems Group, University of York, UK2Dept. of Computer Science and Information Engineering, National Taiwan University, Taiwan

3Research Center for Information Technology Innovation, Academia Sinica, [email protected], [email protected], [email protected], [email protected],

[email protected]

ABSTRACT

Improving battery life in mobile phones has become a topconcern with the increase in memory and computing require-ments of applications with tough quality-of-service needs.Many energy-efficient mobile solutions vary the CPU andGPU voltage/frequency to save power consumption. How-ever, energy-aware control over the memory bus connectingthe various on-chip subsystems has had much less interest.This measurement-based study first analyse the CPU, GPUand memory cost (i.e. product of utilisation and frequency)of user-centric smartphone workloads. The impact of mem-ory frequency scaling on power consumption and quality-of-service is also measured. We also present a preliminaryanalysis into the frequency levels selected by the differentdefault governors of the CPU/GPU/memory components.We show that an interdependency exists between the CPUand memory governors and that it may cause unnecessaryincrease in power consumption, due to interference with theCPU frequency governor. The observations made in thismeasurement-based study can also reveal some design in-sights to system designers.

1 Introduction

As smartphone applications become enriched with newerfeatures and smoother user-experience, their computing andmemory requirements also increase, leading to higher powerconsumption. Therefore, energy-efficient personal comput-ing without significant impact to user experience, has be-come the top-most research concern. From a hardware per-spective, modern smartphone system-on-chips (SoCs) cancontain multiple dedicated IPs and a heterogeneous power-efficient multi-core mobile processor (e.g. ARM big.LITTLEprocessor [1]). From a software perspective, OS driven powersaving techniques such as Dynamic voltage and frequency

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copiesare not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page. Copy-rights for components of this work owned by others than ACM mustbe honored. Abstracting with credit is permitted. To copy otherwise,or republish, to post on servers or to redistribute to lists, requires priorspecific permission and/or a fee. Request permissions from [email protected].

ACM SAC ’18 April 9-13, 2018, Pau, France

©2017 ACM. ISBN XXX-XXXX-XX-XXX/XX/XX. . . $xx.xx

DOI: XX.XXX/XXX X

scaling (DVFS) are commonly used, where the operating fre-quency of compute cores is scaled based on a performancemetric such as system load.

The communication between on-chip IP components takesplace via one or more internal buses and shared main mem-ory. Hence, the energy consumed by the memory subsystemand bus fabric can also be significant, yet it has been givenlimited attention. As smartphones are increasingly used formedia-rich, high-memory throughput applications, memorybus related power consumption can outweigh the CPU andGPU [4]. Therefore, several recent work have started focus-ing on memory and bus power saving techniques (e.g. [3] [6]).

This work investigates power consumption, on-chip resourceusage characteristics and quality-of-service (QoS) related tomemory bus frequency scaling, specifically for user-centricsmartphone workloads. The overall contributions made inthis paper are as follows:

• We characterise several common smartphone macro-workloads in terms of CPU, GPU and memory bususage and show how different application workloadscan result in varying resource usage.

• We investigate the effect on CPU and GPU usage bychanging the memory bus frequency.

• We analyse the interplay between CPU and memorybus frequency scaling and give examples of how unnec-essary energy dissipation can occur.

• From our observations, useful design suggestions aredrawn, which can help system designers to efficientlybalance power consumption and QoS.

2 Related work

2.1 Smartphone workload characterisationIn [5], it is shown that in certain micro-workloads that havea high memory footprint, RAM power consumption can ex-ceed CPU power by a small margin. Their subsequent workshows the MIF (memory interface) and INT (internal) buspower consumption can use more energy than the RAM,CPU or GPU in idle states as well as in interactive scenar-ios (e.g. gaming) [4]. Pandiyan and Wu [18], show that28-40% of total mobile system power consumption can bedue to data movement across the memory hierarchy, espe-cially in the case of applications with high cache miss rates(e.g. HD video playback).

Initial studies by Gutierrez et al. [9], show that standardmicro-benchmarks such as SPEC2000 do not accurately re-flect the resource usage of user-centric interactive smart-phone applications. Gao et al. [8] observe that offloadingapplications with high thread/data-level parallelism ontoGPUs or accelerators can offer better energy efficiency. Thespeed of user interaction (e.g. scrolling speed on a webbrowser) can also impact the amount of CPU/GPU usageand respective memory bandwidth used by the application;as well as cause significant fluctuations in power consump-tion [20].

2.2 Smartphone power management

The Linux ondemand CPU frequency governor is a popu-lar kernel-level entity that increases the clock frequency tothe maximum when the CPU load is above a pre-definedthreshold and decreases the frequency gradually when theCPU load is below a lower threshold [17]. However, load-dependent frequency governors often raise the frequency hi-gher than required by the target QoS, thus wasting powerconsumption [12].

Co-operative, runtime CPU and GPU frequency scaling hasbeen shown to reduce power consumption by 58% (at thedesired QoS level), over the default ondemand governor, formobile 3D games [19]. Similarly, Chen et al. [7] identifyCPU-bound and GPU-bound phases of a mobile 3D game toset the CPU-GPU frequencies according to the game phaseand user interactivity. They are able to outperform previ-ous work [19], for highly interactive 3D games with frequentphase switching. The work in [19] has also been extendedto incorporate off-chip memory access time and access rate(with a fixed memory frequency) into the performance-costmodel [11].

Decreasing the memory frequency essentially increases theamount of time the CPU has to wait for memory fetchesand thereby increases the CPU utilisation and decreases theGPU utilisation [19]. Therefore, there is a risk that the over-all system power can still increase if the combined powerreduction of the GPU and memory is not sufficient to out-weigh the power increase of the CPU. Our study aims toinvestigate if this CPU-GPU-memory relationship can alsobe seen for non-3D-gaming workloads.

Memory-aware DVFS has been explored by Chaudhary etal. [6], where they use DMA and processor hardware perfor-mance counters (HPC), to predict the required bus band-width of the system and improve bandwidth utilisation. Na-chiappan et al. [16] explore several greedy workload-awaremulti-component (CPU/memory/other IPs) frequency se-lection policies that use application slack as a metric. Be-gum et al. [3] demonstrate that relaxing the bounds onperformance-energy trade-off can lead to lower CPU/memoryfrequency tuning overhead and obtaining optimal multi-co-mponent frquencies is complex due to their interplay. Intheir most recent work, runtime predictive algorithms areused to tune the CPU and DRAM frequencies to stay be-low an inefficiency (wasted energy) budget [2]. However,note that unlike our study, several of the work discussedabove (e.g. [6], [3], [2]) use non-interactive workloads. Fur-thermore, they do not investigate the interplay between theCPU and memory bus frequency governors.

3 Problem formulation

As outlined in Section 2, the memory controller and memorybus power consumption can be a significant contributor tothe overall system power, especially in memory-bound ap-plications. Newer mobile chipsets and kernel drivers are nowequipped with functionality to control the memory bus fre-quency. Even though energy-efficient CPU-GPU frequencygoverning has been extensively explored, the impact of mem-ory bus frequency scaling has not been investigated in de-tail. Whilst work such as in [19] and [11] find memory DVFSunattractive for saving power due to excessive CPU stalling,recent work in [2] and [6] demonstrates the benefits of con-trolling the memory frequency. Therefore, this study aimsto answer the following research questions:

• Does reducing the memory frequency help to reducethe power consumption of mobile macro-workloads ?If so, is there a performance/QoS penalty ?

• Is there a relationship between the CPU and memoryfrequencies ?

4 Experimental design

4.1 Experimental platform

All measurements were conducted on a Samsung Galaxy S4(i9500, Android 4.2.2) device with an Exynos 5410 chipset(8-core/2-clusters: 4x1.6GHz Cortex-A15 & 4x1.2GHz Cor-tex-A7). It has 2GB RAM (dual-channel 800 MHz LPDDR3- 12.8 GB/s), a PowerVR SGX544 MP3 GPU (532MHz maxGPU clock) and a 1080p WQXGA display (brightness fixedat 33%) . Dedicated image signal processors (ISP) and hard-ware video codec IPs are also included in the SoC.

The Monsoon power monitor device [15] was used to mea-sure total device power consumption, with a 5KHz samplingrate. An android native program (written in C) was devel-oped to collect periodic runtime CPU/GPU/memory per-formance statistics at 200ms intervals. Monitoring overheadwas less than 1% of the total system load. USB-charging in-terferes with power measurement, therefore Wifi-based an-droid device bridge (ADB) connection was used to controlthe device. Internet traffic and unnecessary backgroundservices were disabled. The RepetiTouch application wasused to record/re-play touch events to reliably reproduce theworkload against different treatments. Runtime frameratewas captured using the dumpsys system command (periodi-cally called only once per second), as measured by Surface-

Flinger service.

4.1.1 Measuring CPU/GPU utilisation and frequency

The ondemand [17] CPU governor is used for all primaryexperiments; 17 distinct CPU frequency levels are available.The proc/stat file system was read to calculate per coreand total CPU utilisation (i.e. busy time/total time). Avendor-specific GPU frequency governor is enabled, whichuses the GPU utilisation to select an appropriate frequencyfrom 5 different levels. The runtime GPU utilisation andfrequency is read from the sysfs interface.

4.1.2 Measuring and controlling the memory bus

The Exynos 5410 chipset has two memory bus groups andassociated voltage regulators and vendor-specific drivers to

control the MIF/INT bus frequencies. The MIF bus regu-lator controls the power to the bus connected to the mainmemory interfaces. The INT bus regulator provides powerto the buses connected to the various internal IPs (e.g. signalprocessors, hardware accelerators, display controller etc.).The drivers use PPMUs (Platform Performance Monitor-ing Units) to monitor system load/usage. The behaviour ofthe default memory bus governor is similar to the conserva-tive governor where it increases/decreases the bus frequencybased on the utilisation. The max/min frequency can be ad-justed at runtime (Listing 1), to fix the bus frequency at aspecific level. There are 4 MIF bus frequency levels (max.800MHz, min. 100MHz) and 11 INT bus frequency levels(max. 800MHz, min. 50MHz) provided. A user-space in-terface provided by the vendor, can be used to read the busbandwidth and utilisation as shown in Listing 1. In 2011,Samsung introduced the above device frequency governingfunctionality into the mainline Linux kernel [10].

Listing 1: Exynos5410 memory bus frequency control andmonitoring

* Enable ke rne l compi la t ion f l a g s :CONFIG SAMSUNG NOCPMONITOR=yCONFIG SAMSUNG BWMONITOR=y* Enable and view bus bandwidth monitor :echo 1 2 > / sys / dev i c e s / plat form/exynos5−

→ busfreq−i n t / dev f r eq /exynos5−busfreq−i n t /→ bw monitor

cat / sys / ke rne l /debug/bw monitor* Set MIF/INT frequency :echo 800000 > / sys / dev i c e s / plat form/exynos5−

→ busfreq −[mif / i n t ] / dev f req /exynos5−→ busfreq −[mif / i n t ] / max freq

echo 800000 > / sys / dev i c e s / plat form/exynos5−→ busfreq −[mif / i n t ] / dev f req /exynos5−→ busfreq −[mif / i n t ] / min f req

4.2 WorkloadsWe primarily measure user-centricmacro-workloads as shownin Table 1, that represent popular real-world smartphoneuse-cases and applications (e.g. web browsing, typing onthe instant messaging application, scrolling on newsfeed orsocial media timeline etc.) [8] [20]. Several background work-loads (e.g. ffmpeg0, music0, ftp0 ) are also chosen to analysecommon non-GPU/non-interactive workloads.

The CPU and GPU share the system memory bandwidth.Therefore, we also use micro-workloads in this work (Ta-ble 1-bottom), to primarily investigate the CPU-memoryfrequency scaling interplay without interference from theGPU or other on-chip hardware components. These micro-workloads do not use the GPU and been designed to inde-pendently stress the CPU/memory subsystems [13]. Notealso that the micro-workloads have different runtimes, withmicro2 and micro0 having the shortest and longest execu-tion time respectively.

4.3 ScenariosThe macro-workloads are run at different MIF/INT bus fre-quency levels; however, certain SoC components require aminimumMIF/INT frequency level for stable operation (e.g.display: MIF>200MHz, camera ISPs: MIF>=800MHz, INT>= 600MHz). For brevity, the MIF/INT frequencies are de-noted as a frequency pair (e.g {800MHz, 800MHz}). Each

macro-workload was tested under at least 3 distinct MIF/INTfrequency levels as well as under the default MIF/INT fre-quency governor. In all macro-workloads, the CPU andGPU frequency governors were set to ondemand [17] andthe vendor-specific implementation respectively.

The micro-workloads, were tested under the frequency set-tings as shown in Table 2, to inspect the CPU frequencyscaling as the MIF/INT frequencies are changed. In rand-MIF and randINT the MIF/INT frequencies are indepen-dently varied (randomly selected from 800/400/200 MHz)every 0.5 seconds and in randMem they are jointly varied.

4.4 MetricsFor the macro-workloads, we measure the power consump-tion, the QoS (frame rate) of foreground applications and la-tency of background applications. QoS metric such as framerate and latency gives us an indication of responsiveness. Weintend to analyse the impact on power, QoS and latency byvarying the MIF/INT bus frequencies.

System-level statistics such as memory bus saturation, busbandwidth, dynamic CPU/GPU/MIF/INT frequency leveland CPU/GPU utilisation are gathered. Memory bus sat-uration (i.e. utilisation) and CPU/GPU utilisation is cal-culated as the ratio of the busy clock cycles over the totalclock cycles in a sampling period. The CPU utilisation met-ric represents the average utilisation across all CPU cores.There exists an interdependency between the CPU/GPUutilisation and the respective frequencies assigned by thefrequency governors. Pathania et al. [19] introduces a uni-fied CPU/GPU cost metric (cost = frequency × utilisation),to understand the overall work done by the CPU/GPU.We measure and calculate the CPU and GPU cost similarto [19], as well as the memory cost = memory util.×MIF-freq.×INT-freq.

For the micro-workloads we measure system-level perfor-mance metrics similar to the macro-workloads. We alsomeasure the mean and sum of all CPU frequencies and thenumber of CPU frequency transitions, over the workloadruntime. The sampling rate of measurements was increased(50ms interval) to improve accuracy.

5 Measurement study observations

5.1 Macro-workload resource usageThe IP-level memory bandwidth is shown in Fig.1a. CPU/GPUrelated bus saturation (utilisation) spikes during browser-scrolling interactivity (chrome0 ). For interactive workloads,graphics (gfx) and display controller (disp) requires higheraverage memory bus bandwidth than the CPU. In Fig.1bwe can see that the other IPs (e.g. MFC - hardware me-dia codecs, ISP - hardware signal processors) can also sig-nificantly contribute to the total memory bandwidth, forapplications such as video recording (camera1 ). Due tospace-constraints, only two macro-workloads are presentedin Fig.1. Other macro/micro workload system performancemeasurements are made available online [14].

The resource usage for all macro-workloads using the defaultfrequency governors are shown in Fig.2. ffmpeg0, game0 andcamera1 have the highest average CPU, GPU and memoryusage respectively. The measurements indicate that non-interactive tasks (e.g. background/idle/suspended-state tas-ks and vlcplayer0 ) have lower GPU/memory usage variation.

Table 1: Experimental workloads - macro and micro

User-centric smartphone macro-benchmarksidle0 Device in suspended state - No apps running, display offidle1 Device in idle state - No apps running, display onlauncher0 Swipe left (7 times) on default home screens containing widgets and iconsffmpeg0 Software decoding of 720p video, without video output using ffmpeg (7500 frames), display offvlcplayer0 Video playback 720p/25fps (hardware decoding), 1 minline0 LINE App. Instant messaging (normal typing speed)line1 LINE App. Instant messaging (very slow typing followed by very fast typing speed)line2 LINE App. Instant messaging (open photo album, select image and send to contact)facebook0 Facebook Mobile app - Swipe up/down on timelinefacebook1 Facebook Mobile app - Open photo album, swipe left/rightcamera0 Default Camera app. Tap to focus, take picturecamera1 Default Camera app. Record video 1080p/30fps 1 minmusic0 Background (display off) audio playback using vlcplayer (44.1KHz, 128kbps), 1 minftp0 Background (display off) FTP download (20MB x 10). Rep. of downloading automatic software updates.chrome0 Chrome mobile browser - bbc.com/news : swipe up/downgame0 3D game - Asphalt 8 car racing (game loading + 1 min gameplay)

Micro-benchmarks (multi-threaded) taken from [13]micro0 Serial and random memory read/write tests at increasing data sizes and increasing thread counts.micro1 Designed to read data from RAM in burstsmicro2 The android port of Dhrystone integer benchmark (each thread executing copies of the same program)micro3 Floating point add, multiply arithmetic operations (2, 32 operations per input word size)

0 10 20 30 40

time (s)

0

5

10

15

20

25

30

35

40

45

Bus s

atu

rati

on %

chrome0--MIF-800000:INT-800000

mfc0mfc1

isp0isp1

genfsys

gfx-mem0gfx-mem1

cpu-mem0cpu-mem1

disp1

Browser scrolling

(a)

0 10 20 30 40 50 60time (s)

0

5

10

15

20

25

30

Bus

satu

ratio

n %

camera1--MIF-800000:INT-800000

mfc0mfc1

isp0isp1

genfsys

gfx-mem0gfx-mem1

cpu-mem0cpu-mem1

disp1

(b)

Figure 1: Example of bus saturation/utilisation profile for two application workloads (a) Web browsing (chrome0 ) (b) Videorecording (camera1 ). disp=Display Controller, gfx=Graphics, mfc=Hardware media codecs, isp=Hardware signal processors

Table 2: Micro-workload CPU, MIF/INT frequency settings(OD:ondemand, RND: random)

Name CPU(GHz)

MIF (MHz) INT (MHz)

default OD def. governor def. governorfixedAll 1.4 400 160400-600 OD 400 160800-800 OD 800 800randMIF OD RND(800,400,200) 160randINT OD 800 RND(800,400,200)randMem OD RND(800,400,200) RND(800,400,200)

Note that non-gaming activities such as instant-messaging(line2) or web-browsing (chrome0) can have high GPU util-isation as well. Similarly, non-gaming workloads (e.g. face-book1 and chrome0 ) can have higher peak memory coststhan the 3D applications. Applications such as camera1, canshow relatively higher memory usage than CPU/GPU usage,due to dedicated IPs generating memory traffic. Memory

utilisation patterns for the same application can also varydepending on the use-case (e.g. facebook0 and facebook1 ),due to the behaviour of the default memory governor.

At very high peak utilisation levels (e.g. chrome0 and face-book1 ), memory bus contention and congestion occurs, re-sulting in memory transactions being buffered/dropped; thisin turn can negatively impact user-experience. On certainworkloads, we observed the memory bus frequency conserva-tively increased to the maximum, even though at that levelthe peak utilisation is low (e.g. vlcplayer0 ). However, dueto the limited number of frequency levels, under-utilisationis unavoidable.

5.2 Impact of memory bus frequency on re-source utilisation

Fig.3 shows the CPU and GPU cost with respect to differentMIF/INT bus frequency levels. Overall, the general trendis that for interactive applications (e.g. Facebook, LINE in-stant messaging, Chrome web browsing), as the MIF/INT

0

20

40

60

80

100 idle0 idle1 launcher0 ffmpeg0 vlcplayer0 music0 camera0 camera1

0

20

40

60

80

100 line0 line1 line2 facebook0 facebook1 chrome0 ftp0 game0

cpu_utilcpu_cost

gpu_utilgpu_cost

mem_utilmem_cost

Figure 2: Macro-workload - CPU/GPU/memory utilisation and cost (MIF:default, INT:default)

bus frequency is decreased, the CPU/GPU cost is increased.This is mainly due to CPU/GPU governors increasing theirfrequency level to overcome the higher utilisation levels dueto stalling (waiting for data to arrive). In our 3D gamemeasurements (i.e. game0 ) the GPU cost does not changesignificantly as the MIF frequency is decreased by half, eventhough the CPU cost increases. This indicates that per-haps certain resource management algorithms which differ-entiate between CPU-GPU bound phases (e.g. [7]), can ex-ploit memory frequency reduction appropriately to furtherreduce power consumption. Note that certain applicationssuch as camera0 and camera1 have sub-IP specific minimumMIF/INT frequency levels required to operate.

In lightweight background applications (e.g. ftp0, idle0,idle1 ), a similar trend to interactive/foreground applicationsis not seen. In music0, we can only see a significant CPUcost increase only for the lowest MIF/INT frequency level.In several workloads a sharp rise in CPU and GPU cost isseen between MIF values 400 and 200 MHz, which indicatesthat system performance is more susceptible to MIF fre-quency changes than INT frequency changes. ffmpeg0 showsa high CPU cost for the default case. This is mainly becausethe default CPU governor stays mostly above 1GHz and in-cidentally the MIF frequency governor fluctuates between800MHz and 200MHz.

5.3 Impact of memory bus frequency on powerconsumption and Quality-of-Service

The power consumption for each macro-workload scenario isshown in Fig.4. The power distributions correlate well withthe CPU cost distributions (Fig.2). For example, launcher0has a long tail power distribution because of swiping relatedCPU frequency spikes. High power consumption can be dueto very high CPU usage (ffmpeg0 ) or combined high levels ofmemory and GPU costs camera0, game0. Background/non-interactive tasks have the lowest power consumption andvariation due to low resource usage.

As shown in Fig.5 and Fig.6, the total system power con-

default

800-8

00

800-7

00

800-6

00

800-2

00

400-4

00

400-5

0

200-2

00

200-5

0

100-5

00

10

20

30

40

50

60cpu_cost

default

800-8

00

800-7

00

800-6

00

800-2

00

400-4

00

400-5

0

200-2

00

200-5

0

100-5

00

20

40

60

80

100gpu_cost

idle0idle1launcher0ffmpeg0

vlcplayer0music0camera0camera1

line0line1line2facebook0

facebook1chrome0ftp0game0

Figure 3: Macro-workload - CPU/GPU/Mem. cost withrespect to MIF/INT bus frequency change

sumption and the QoS varies when the memory bus fre-quency is changed. In these visualisations, the bar plots aresorted and overlaid on top of each other. For example inFig.5, in the idle0 scenario, 800-800 has the highest nor-malised power level and 100-50 has the lowest.

In line0, up to 40% power consumption difference betweenhigh-low memory frequencies can be seen; indicating thatmemory frequency scaling does impact power consumption.MIF bus frequency changes cause larger power consumptiondifferences than INT bus frequency changes. The lowestMIF/INT frequencies can adversely affect both the powerconsumption (e.g. line0, line1 and line2 ) and QoS levels,due to corresponding CPU/GPU utilisation increase.

The default memory governor can at times be conservative,leading to unnecessarily high memory frequencies and powerconsumption. For the background/non-interactive applica-tions, up to 5% mean power reduction can be obtained with

Figure 4: Macro-workload power consumption distribution(MIF:default, INT:default)

idle0

idle1

launc

her0

ffmpe

g0

vlcpla

yer0

camera

0

camera

1

music0 lin

e0lin

e1lin

e2

faceb

ook0

faceb

ook1

chrom

e0 ftp0

game0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Norm

alis

ed p

ower

con

sum

ptio

n

default-default800-800

800-700800-600

800-200400-400

400-50200-200

200-50100-50

Figure 5: Macro-workload normalised mean power consump-tion, for all tested MIF/INT frequencies. Normalised withineach scenario

minimal increase in application response-time, when usinga fixed lower memory bus frequency setting over the de-fault governor. Similarly, foreground applications such asvlcplayer0 and game0 showed 10-20% power reduction overthe default governor at a QoS reduction of approx. 5%. Forchrome0, the MIF/INT frequency at (400MHz, 200MHz)has comparable power consumption to the default case, butgave a 30% QoS improvement. These results indicate thatthe relationship between memory bus frequency, power con-sumption and QoS are not always linear.

launc

her0

ffmpe

g0

vlcpla

yer0

camera

0

camera

1

music0 lin

e0lin

e1lin

e2

faceb

ook0

faceb

ook1

chrom

e0 ftp0

game0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Norm

alis

ed Q

oS

default-default800-800

800-700800-600

800-200400-400

400-50200-200

200-50100-50

Figure 6: Macro-workload normalised mean QoS levels, forall tested MIF/INT frequencies. Normalised within eachscenario

(a) (b)

Figure 7: Example illustration of CPU frequency interfer-ence due to memory frequency scaling (x-axis denotes time)

5.4 Example of CPU and memory bus freque-ncy governor inter-dependency

The example illustrations shown in Fig.7, are used to de-scribe how memory frequency scaling can affect the decisionsof the CPU governor. We assume the governors are not syn-chronised and unaware of each others decisions. Fig.7(a)first illustrates a case where CPU frequency is increased un-necessarily and Fig.7(b) illustrates an instance where a po-tential CPU frequency reduction opportunity was missed.

In - Fig.7(a), the memory utilisation goes low to 30 at t = 5.0(t denotes time). At t = 5.25, the memory governor lowersthe memory frequency to level 1, leading to a decrease inmemory cost and a rise in CPU utilisation (due to processorstalling). At t = 5.50 the CPU governor samples the CPUutilisation and increases the CPU frequency to address theutilisation increase. In this situation, we assume the CPUload does not change significantly and if the memory fre-quency transition did not occur, the CPU frequency wouldnot have changed.

The second scenario - Fig.7(b) is similar to the first, ex-cept in this instance, it is assumed that the CPU utilisationalso goes low at the same instant (t = 5.0), and if the mem-ory frequency governor did not interfere, the CPU frequencywould have gone low to level 2 (between t = 5.5 − 7.5).Here, the system could have saved power by dropping theCPU frequency, but the CPU governors’ decision was againinterfered by the memory frequency transition.

5.5 Micro-workload analysisThe micro-workloads (Table 1) and treatments shown in Ta-ble 2 were used to further explore the memory bus and CPUfrequency scaling dependency. As shown in Fig.8, micro2has a large variation in CPU utilisation and micro0 and mi-cro1 have the highest and largest variation in memory busutilisation. micro0 is mostly memory-bound, whilst the in-

0

20

40

60

80

100micro0 micro1 micro2 micro3

cpu_utilcpu_cost

gpu_utilgpu_cost

mem_utilmem_cost

Figure 8: Micro-workload CPU/memory utilisation and cost(fixedAll test case)

verse is true for micro3. None of the micro-workloads haveGPU load.

In the micro-workload experiment, we measure the numberof CPU frequency transitions and the sum of all frequenciessampled, for each of the treatments (Table 2). If the metricssignificantly differ between the different test cases, then itindicates that the MIF/INT frequency scaling has affectedthe decision of the CPU frequency scaling governor.

Fig.9a shows the number of CPU frequency transitions ofthe micro-workloads. We observe that depending on thetreatment, the CPU frequency transitions can vary, whichindicates that the memory frequency transitions have in-deed interfered with the decisions of the CPU frequencygovernor. In all cases, the default treatment is different tothe other treatments. For example, in micro0, where ran-domly changing the memory frequency (i.e. rand* ), resultedin over 25-30% less CPU transitions than the default case.Non-uniform CPU/memory utilisations can result in lowernumber of CPU frequency transitions than the default treat-ment (e.g. randMIF ). In micro1, randINT produces slightlyhigher CPU frequency transitions than randMIF, which in-dicates that under certain workloads, the INT bus frequencycan affect the CPU frequency governor more than the MIFbus frequency.

The summation of all CPU frequencies for the workload run-time is displayed in Fig.9b. We can see that high numberof CPU frequency transitions do not necessarily result in ahigher total CPU frequency (e.g. micro0 ), and vice-versa.In the case of randINT, lower number of transitions can re-sult in higher total CPU frequency than default (in micro0and micro3 ). Fixing the memory bus frequency to the high-est level can often reduce the total CPU frequency (e.g. 800-800 treatment). Furthermore, in all cases except micro3,randomly scaling the MIF bus frequency (i.e. randMIF ) canresult in lower total CPU frequency than the default mem-ory frequency governor; this indicates inefficiencies with thedefault frequency governor.

6 Summary of key observations and system

design suggestionsWe now summarise the key observations made during themeasurement-based study and present several resource man-agement design suggestions drawn from the observations. Itis hoped these suggestions would be beneficial for systemdesigners to alleviate unnecessary energy dissipation.

• Usage-scenarios and QoS targets should be integratedto CPU frequency governor heuristics, to efficiently

micro0 micro1 micro2 micro30

50

100

150

200

# o

f CPU

-freq

tran

sitio

ns

default400-160

800-800randMIF

randINTrandMem

(a)

micro0 micro1 micro2 micro30.0

0.5

1.0

1.5

Tota

l CPU

-freq

(Hz)

1e9

default400-160

800-800randMIF

randINTrandMem

(b)

Figure 9: Micro-workload tests (a) Number of total CPUfrequency transitions (b) Summation of CPU frequency overtest runtime

balance power consumption and quality of experience.Governors that solely focus on utilisation, can unneces-sarily increase the frequency, even for non-performancecritical, although CPU-bound, background applications(e.g. ffmpeg0 ).

• Lower memory bus frequencies can be exploited underbackground applications, as well as during idle/sus-pended states to save power consumption. Certainforeground applications such as video playback andphotography can also benefit from lower memory busfrequencies with minimal impact to QoS.

• Workload prediction algorithms (e.g. [19] [11]) shouldalso consider the memory bus traffic and resource con-tention generated by dedicated hardware IPs (e.g. sig-nal processing IPs). For example, certain high mem-ory bandwidth applications can have relatively lowerCPU/GPU usage (e.g. camera recording).

• Symmetrically scaling different system-on-chip inter-nal bus frequencies (e.g. MIF, INT) should be avoided,as their load and performance/power-saving impactcan also be different. The ability to finely tune differ-ent bus frequencies can also increase the bus frequencygovernor’s flexibility.

• Certain foreground application QoS issues can be ad-dressed by setting the memory frequency at the high-est, to handle bursty memory traffic (e.g. web browserscrolling, swiping through images). Furthermore, dy-namic governor sampling periods (based on rate ofmemory transactions), can balance accuracy and mon-itoring overhead.

• In-depth statistical analysis of the workload and CPU/-GPU/memory frequency governors can assist load pre-diction heuristics. For example the impact on memoryfrequency scaling on CPU frequency scaling can differbased on the uniformity of the memory bus utilisation.

• Due to the lack of co-operation and synchronisationbetween the CPU and memory bus frequency gover-nors, the memory bus frequency transitions can in-terfere the CPU frequency governor’s decisions (Sec-tion 5.4). Therefore, a cooperative/holistic frequencygoverning framework, which can assign an appropriatefrequency setting for all the different components/re-sources on the SoC is required. E.g. ensure that de-

creasing memory bus frequency does not change theCPU frequency. Such an interconnected governor sho-uld take into account the performance/QoS require-ments, state of computing resources and also the com-munication subsystems (i.e. bus, memory controllersetc.) when making a frequency selection. A unifiedgovernor can have more control and flexibility on theperformance of the different on-chip resources to avoidunnecessary power consumption.

7 Conclusion and future workThis work presented on-chip resource usage and power con-sumption measurements of popular user-centric smartphoneworkloads. In particular, we analysed the impact of memoryfrequency scaling on power consumption and its interplaywith the CPU and GPU. We presented the CPU, GPU andmemory cost of user-centric macro-workloads under defaultsystem settings. Our measurements indicated that undercertain application types (e.g. idle states, background appli-cations, hardware-accelerated video playback, 3D gaming)fixing the memory bus frequency at lower frequencies canprovide higher power savings (5-20%) with marginal QoSdegradation, compared to the default memory frequencygovernor. We illustrated two cases where the memory busfrequency governor interferes with the CPU frequency gov-ernor, to either unnecessarily increase power consumptionor fail to save power. Micro-workloads were used to fur-ther demonstrate the CPU frequency governor inconsisten-cies caused by the memory frequency interference. Lastly,we presented several system design suggestions related toCPU/GPU/memory bus resource management drawn fromour observations. As future work, we are working towardsimplementing an interconnected CPU-GPU-memory bus fre-quency governing framework which can help further reducethe power consumption of smartphones.

AcknowledgementThis work was funded by the HiPEAC 2016 collaborationgrant (The FP7 HiPEAC Network of Excellence)

8 References

[1] ARM. big.LITTLE technology.https://developer.arm.com/technologies/big-little,2013.

[2] R. Begum, M. Hempstead, G. P. Srinivasa, andG. Challen. Algorithms for CPU and DRAM DVFSunder inefficiency constraints. In Int. Conf. on Comp.Design (ICCS), page 161aAS168, 2016.

[3] R. Begum, D. Werner, M. Hempstead, G. Prasad, andG. Challen. Energy-performance trade-offs onenergy-constrained devices with multi-componentDVFS. In Int. Symp. on Workload Characterization(IISWC), pages 34–43, 2015.

[4] A. Carroll and G. Heiser. The systems hacker’s guideto the galaxy energy usage in a modern smartphone.In Asia-Pacific Workshop on Systems (APSYS), pages5–12, 2013.

[5] A. Carroll, G. Heiser, et al. An analysis of powerconsumption in a smartphone. In USENIX, pages1–14, 2010.

[6] N. Chaudhary, T. Pallavi, et al. Bus bandwidthmonitoring, prediction and control. In Conf. onAdvances in Comp., Comm. and Informatics(ICACCI), pages 1152–1158, 2015.

[7] W.-M. Chen, S.-W. Cheng, P.-C. Hsiu, and T.-W.Kuo. A user-centric CPU-GPU governing frameworkfor 3D games on mobile devices. In IEEE/ACM Conf.on Computer-Aided Design (ICCAD), pages 224–231,2015.

[8] C. Gao, A. Gutierrez, M. Rajan, R. G. Dreslinski,T. Mudge, and C. J. Wu. A study of mobile deviceutilization. In Symp. on Perf. Analysis of Sys. andSoftware (ISPASS), pages 225–234, 2015.

[9] A. Gutierrez, R. G. Dreslinski, T. F. Wenisch,T. Mudge, A. Saidi, C. Emmons, and N. Paver.Full-system analysis and characterization of interactivesmartphone applications. In Int. Symp. on WorkloadCharacterization (IISWC), pages 81–90, 2011.

[10] M. Ham. Introduce devfreq: generic DVFS frameworkwith device-specific opps.https://lwn.net/Articles/445044/, May 2011.

[11] C. Y. Hsieh, J. G. Park, N. Dutt, and S. S. Lim.Memory-aware cooperative CPU-GPU DVFS governorfor mobile games. In IEEE Symp. on Embedded Sys.For Real-time Multimedia (ESTIMedia), pages 1–8,2015.

[12] E. Le Sueur and G. Heiser. Dynamic voltage andfrequency scaling: The laws of diminishing returns. InInt. conf. on Power aware Comp. and Sys., pages 1–5,2010.

[13] R. Longbottom. Android benchmarks by RoyLongbottom. http://www.roylongbottom.org.uk/android%20benchmarks.htm, 2017.

[14] H. Mendis. System performance measurement data.https://goo.gl/mQBrGS.

[15] Monsoon. Power monitor. https://www.msoon.com/LabEquipment/PowerMonitor/,2017.

[16] N. C. Nachiappan, P. Yedlapalli, N. Soundararajan,A. Sivasubramaniam, M. T. Kandemir, R. Iyer, andC. R. Das. Domain knowledge based energymanagement in handhelds. In Int. Symp. on High PerfComp. Arch. (HPCA), pages 150–160, 2015.

[17] V. Pallipadi and A. Starikovskiy. The ondemandgovernor. In Linux Symposium, pages 215–230, 2006.

[18] D. Pandiyan and C. J. Wu. Quantifying the energycost of data movement for emerging smart phoneworkloads on mobile platforms. In Int. Symp. onWorkload Characterization (IISWC), pages 171–180,2014.

[19] A. Pathania, Q. Jiao, A. Prakash, and T. Mitra.Integrated CPU-GPU power management for 3Dmobile games. In Design Automation Conf. (DAC),pages 1–6, 2014.

[20] S. Patil, Y. Kim, K. Korgaonkar, I. Awwal, and T. S.

Rosing. Characterization of useraAZs behaviorvariations for design of replayable mobile workloads.In Mobile Comp, Apps., and Services Conf., pages51–70, 2015.

Impact of Memory Frequency Scaling on User-centric ... · All measurements were conducted on a Samsung Galaxy S4 (i9500, Android 4.2.2) device with an Exynos 5410 chipset (8-core/2-clusters:

Documents