Top Banner
1 Update on big.LITTLE on TC2 Morten Rasmussen Technology Researcher
28

LCE12: big.LITTLE TC2 update

Jun 13, 2015

Download

Technology

Linaro

Resource: LCE12
Name: big.LITTLE TC2 update
Date: 29-10-2012
Speaker: Morten Rasmussen
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: LCE12: big.LITTLE TC2 update

1

Update on big.LITTLE on TC2

Morten RasmussenTechnology Researcher

Page 2: LCE12: big.LITTLE TC2 update

2

Agenda

big.LITTLE Software solutions overview

ARM's Test Chip 2 overview

Benchmarking Methodology and Use Cases

IKS status update

big.LITTLE MP status update

Page 3: LCE12: big.LITTLE TC2 update

3

big.LITTLE overview

Performance and power efficiency in one system:

Cortex-A15 vs Cortex-A7 Performance

Cortex-A7 vs Cortex-A15 Energy Efficiency

Dhrystone 1.9x 3.5x

FDCT 2.3x 3.8x

IMDCT 3.0x 3.0x

MemCopy L1 1.9x 2.3x

MemCopy L2 1.9x 3.4x

Page 4: LCE12: big.LITTLE TC2 update

4

IKS solution – Basics

In-Kernel Switcher (IKS): Targeted first generation big.LITTLE products.

Cortex-A7

Cortex-A15

Kernel

scheduler IKS

Task 1

Task 2

Logical CPU ?

Page 5: LCE12: big.LITTLE TC2 update

5

MP solution

Cortex-A7

Cortex-A15

Kernel

scheduler

Task 1

Task 2

?

Page 6: LCE12: big.LITTLE TC2 update

6

ARM’s Test Chip 2 (TC#2): An Overview

A Versatile Express core tile publically available: Capabilities

2 x A15 (r2p1) @ up to 1.2 Ghz

3 x A7 (r0p1) @ up to 1Ghz

CCI/DMC/GIC/ADB (r0p0)

DMA (PL330)

2GB external DDR2 memory @ 400Mhz

64k internal SRAM

Coresight debug (including JTAG and ITM trace but no STM)

No GPU

cpufreq support: Independent for each cluster with limited voltage scaling

cpuidle support: Cluster power gating

TC2

Page 7: LCE12: big.LITTLE TC2 update

7

Benchmarking Methodology

ResultsPerformancePowerConfigurable:- CCI- ftrace- streamline

CSV config:- Use case- Scheduling model- Numbers of cores to use - Scaling governors

Automated system for running user workloads on target deviceChoose workload

Choose CPU mode: Cortex-A7, Cortex-A15, Migration (cluster or CPU), or MP

Choose active cores in each clusterTC2: 1-2 big, 1-3 LITTLE

Choose DVFS governor:Interactive, performance, powersave, ondemand

Extensible – parameterisation

Page 8: LCE12: big.LITTLE TC2 update

8

IKS solution

Targeted first generation big.LITTLE products.

Cortex-A7

Cortex-A15

Kernel

scheduler IKS

Task 1

Task 2

Logical CPU ?

Page 9: LCE12: big.LITTLE TC2 update

CONFIDENTIAL9

IKS: CPU Migration big.LITTLE extends DVFS DVFS algorithm monitors load on each

CPU

When load is low it can be handled on a LITTLE processor

When load is high the context is transferred to a big processor

The unused processor can be powered down

When all processors in a cluster are inactive the cluster and its L2 cache can be powered down

Page 10: LCE12: big.LITTLE TC2 update

CONFIDENTIAL10

IKS: CPU Migration big.LITTLE extends DVFS DVFS algorithm monitors load on each

CPU

When load is low it can be handled on a LITTLE processor

When load is high the context is transferred to a big processor

The unused processor can be powered down

When all processors in a cluster are inactive the cluster and its L2 cache can be powered down

Page 11: LCE12: big.LITTLE TC2 update

11

IKS: OPP mapping to A7 / A15 on TC2

Virtual Frequency maps OPPs to big or LITTLE coresVirtual OPP

Physical OPP A7

Physical OPP A15

Voltage

A7

350000 350000 V1400000 400000 V1... X X V1800000 800000 V1900000 900000 V2

1000000 1000000 V3

A15

1200000 600000 V11400000 700000 V1

... X 2X V12000000 1000000 V12200000 1100000 V22400000 1200000 V3

Page 12: LCE12: big.LITTLE TC2 update

12

IKS: Results for Audio on TC2

Power compared to executing the use case on A15

IKS does not use A15s during Audio run

70% saving

TC2:A15 up to 1.2 GHzA7 up to 1 GHzBetter results expected on representative silicon.

Page 13: LCE12: big.LITTLE TC2 update

13

IKS: Results for BBench + Audio on TC2

Performance is measured as from page loading times of BBench

Results normalised to power and performance consumed on same use case run on A15 only

BBench page + Audio

TC2:A15 up to 1.2 GHzA7 up to 1 GHzBetter results expected on representative silicon.

Page 14: LCE12: big.LITTLE TC2 update

14

IKS: OPPs on TC2

Page 15: LCE12: big.LITTLE TC2 update

15

IKS: Interactive governor on TC2

if (cpu_load >= go_hispeed_load){ ... new_freq = max_freq * cpu_load / 100; ... } else { ... new_freq = hispeed_freq*cpu_load/100; ... }

For A15 on TC2 with a go_highspeed at 85% (default) this algorithm only uses overdrive section of A15

Approach is to introduce a second point of inflection:highspeed2

Page 16: LCE12: big.LITTLE TC2 update

16

IKS: Hispeed2

Page 17: LCE12: big.LITTLE TC2 update

17

IKS: Results: Bbench + Audio

Power improves with no performance cost

BBench page + Audio

TC2:A15 up to 1.2 GHzA7 up to 1 GHzBetter results expected on representative silicon.

Page 18: LCE12: big.LITTLE TC2 update

18

MP solution

Cortex-A7

Cortex-A15

Kernel

scheduler

Task 1

Task 2

?

Page 19: LCE12: big.LITTLE TC2 update

19

MP solution – more details

Scheduler modifications: Treat big and LITTLE cpus as

separate scheduling domains.

Use PJT's load-tracking patches to track individual task load.

Migrate tasks between the big and the LITTLE domains based on task load.

Patch set available through Linaro.

L BBL

Load balance Load balance

Load-based task migration

Task load

Task state

Executing Sleep

Load decay

Page 20: LCE12: big.LITTLE TC2 update

20

MP: Experimental Implementation Scheduler modifications:

Apply PJTs’ load-tracking patch set.

Set up big and little sched_domains with no load-balancing between them.

select_task_rq_fair() checks task load history to select appropriate target CPU for tasks waking up.

Add forced migration mechanism to push of the currently running task to big core similar to the existing active load balancing mechanism.

Periodically check (run_rebalance_domains()) current task on little runqueues for tasks that need to be forced to migrate to a big core.

L BBL

load_balance load_balance

select_task_rq_fair()/Forced migration

Page 21: LCE12: big.LITTLE TC2 update

21

MP: ARM TC2: Audio

Workload: Audio (mp3 playback)

Performance/Energy target: A7 energy

Status: Audio related task do not use A15s, but

the power consumption is still significantly more than A7 alone.

MP not as power efficient as IKS yet

Todo: Target spurious wake-ups on A15. All

the extra power comes from the A15's which shouldn't be used at all. Energy

A7 30.79%

MP 39.86%

0

10

20

30

40

50

60

70

80

90

100 AudioA15A7 2CPUIKSMP

En

erg

y

TC2:A15 up to 1.2 GHzA7 up to 1 GHzBetter results expected on representative silicon.

Page 22: LCE12: big.LITTLE TC2 update

22

MP: Audio workload analysis

Where is the extra energy spent with MP? Need a look at why A15's consume

power when they are not necessary.

A7 MP0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6Audio energy breakdown

A15 clusterA7 cluster

En

erg

y

hrtimer functions cpu0 cpu1 cpu2 cpu3 cpu4

hrtimer_wakeup 2 2 1212 417 190

tick_sched_timer 404 58 483 507 779

WQ functions cpu0 cpu1 cpu2 cpu3 cpu4

vmstat_update 30 2 27 25 28

cache_reap 15 2 14 13 14

phy_state_machine 31 0 0 0 0

Enter idle cpu0 cpu1 cpu2 cpu3 cpu4

0 6 2 2379 260 423

1 801 807 8316 9373 9652

TC2:A15 up to 1.2 GHzA7 up to 1 GHzBetter results expected on representative silicon.

Page 23: LCE12: big.LITTLE TC2 update

23

Scale invariant load

Load accumulation rate does not scale with available compute capacity (frequency, big/LITTLE cpu)

Currently, there is no link between cpufreq and the scheduler Tasks may be migrated away from a cpu at low frequency by the

scheduler before cpufreq has increased the frequency to match the cpu load.

Scaling the tracked load accumulation to match the current frequency mitigates this issue. Tasks cannot accumulate enough load at low frequency to trigger

migration and must wait for cpufreq to react first.

Freq = x Freq = 2x

Page 24: LCE12: big.LITTLE TC2 update

24

Scale invariant load

76782.1 76782.2 76782.3 76782.4 76782.5 76782.60

200

400

600

800

1000

76332.95 76333.05 76333.15 76333.25 76333.35 76333.450

200

400

600

800

1000

Original Frequency invariant

Page 25: LCE12: big.LITTLE TC2 update

25

Load accumulation rate

For some workloads tracked load saturates too fast and leads to unnecessary task migrations.

Extending the tracked load history reduces tracked load variations due to sudden changes in the load characteristics.

Increasing the y factor in the load expression decreases the load accumulation and decay rates.

load=u0+u1⋅y+u2⋅y2

+…+un⋅yn

1024+ y+ y2+…+ yn

+1

1 21 41 61 81 1016

1116 26

3136 46

5156 66

7176 86

9196 106

111116

121126

131136

141146

151

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1y=0.9785

Time [ms]

y<1,0⩽u<1024

Page 26: LCE12: big.LITTLE TC2 update

26

Load accumulation rate

Increasing y leads to a more conservative tracked load Should lead to less up/down migrations

Increases up/down migrations delay for tasks that needs to be migrated.

1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 974 10 16 22 28 34 40 46 52 58 64 70 76 82 88 94 100

103106

109112

115118

121124

127130

133136

139142

145148

151154

157160

163166

169172

175178

181184

187190

193196

199

Load accumulation rate

Tasky=0.9785y=0.9844y=0.9922

Time [ms]

Tra

cke

d lo

ad

Page 27: LCE12: big.LITTLE TC2 update

27

MP – Top Issues

Spurious wakeups A15s are woken up by scheduler ticks (mainly)

Workqueues

Timers

RCU

cpu wakeup prioritisation Pick the cheapest target cpu

Global balancing Spread load to A7s when A15s are overloaded

Pack vs. spread

Cluster aware cpufreq governors

Page 28: LCE12: big.LITTLE TC2 update

28

Questions?