Top Banner
Cache Optimization for Mobile Devices Running Multimedia Applications Komal Kasat Gaurav Chitroda Nalini Kumar
41

Cache Optimization for Mobile Devices Running Multimedia Applications

Feb 26, 2016

Download

Documents

Ely

Cache Optimization for Mobile Devices Running Multimedia Applications. Komal Kasat Gaurav Chitroda Nalini Kumar. Outline. Introduction MPEG-4 Architecture Simulation Results Conclusion. INTRODUCTION. Introduction. Multimedia. Combination of graphics, video, audio - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cache Optimization for Mobile Devices Running Multimedia Applications

Cache Optimization for Mobile Devices Running Multimedia

Applications

Komal KasatGaurav Chitroda

Nalini Kumar

Page 2: Cache Optimization for Mobile Devices Running Multimedia Applications

OutlineIntroductionMPEG-4ArchitectureSimulationResults Conclusion

Page 3: Cache Optimization for Mobile Devices Running Multimedia Applications

INTRODUCTION

Page 4: Cache Optimization for Mobile Devices Running Multimedia Applications

MultimediaCombination of graphics, video, audioOperates on data presented visually aurallyIn multimedia operations compression is

done such that less significant data to the viewer is discarded

Common events represented by fewer bits while rare events by more bits

Transmitter encodes and transmits, decoder decodes and plays them back

Introduction

Page 5: Cache Optimization for Mobile Devices Running Multimedia Applications

CachesSize and complexity of Multimedia applications is

increasingCritical applications have time constraintsRequires more computational power & more traffic

from CPU to memorySignificant processor/memory speed gapTo deal with memory bottlenecks we use cachesCache improves performance by reducing data

access time

Introduction

Page 6: Cache Optimization for Mobile Devices Running Multimedia Applications

CPU Main Memory

BUS

Memory HierarchyIntroduction

Page 7: Cache Optimization for Mobile Devices Running Multimedia Applications

Main Memory

CPU

Cache

BUS

Memory HierarchyIntroduction

Page 8: Cache Optimization for Mobile Devices Running Multimedia Applications

Main Memory

CPU

CL1

BUS

CL2

Memory HierarchyIntroduction

Page 9: Cache Optimization for Mobile Devices Running Multimedia Applications

Data between CPU and cache is transferred as data object

Data between cache and main memory is transferred as block

CPU Cache Main Memory

Data Object Transfer Block Transfer

Data transfer among CPU, Cache and Main Memory

Introduction

Page 10: Cache Optimization for Mobile Devices Running Multimedia Applications

Why Cache Optimization?With improved CPU, memory subsystem deficiency is main

performance bottleneckSufficient reuse of values for caching to reduce raw

required memory bandwidth for video data High data rates, large sizes and distinctive memory access

patters of MPEG exert strain on cachesThough miss rate acceptable, they increase cache memory

trafficDropped frames or blocking make caches inefficientWe have limited power and bandwidth in mobile

embedded applicationsCache inefficiency has impact on system cost

Introduction

Page 11: Cache Optimization for Mobile Devices Running Multimedia Applications

MPEG-4

Page 12: Cache Optimization for Mobile Devices Running Multimedia Applications

MPEG 4Moving Picture Experts GroupNext generation global multimedia

standardDefines the compression of Audio and

Visual (AV) digital dataEmploys both spatial & temporal

redundancy for compressionWhat is the technique??

MPEG-4

Page 13: Cache Optimization for Mobile Devices Running Multimedia Applications

Break data into 8 x 8 pixel blocksApply Discrete Cosine TransformQuantize, RLE and entropy coding algorithmFor temporal redundancy – motion compensation3 types of frames:◦ I intra : contain complete image, compresses for spatial

redundancy only◦ P predicted : built from 16 x 16 macro blocks

Macro Block: consists of pixels from closet previous I or P frames such that require fewer bits

◦ B bidirectional frames : information not in reference frames is encoded block by block Reference frames are 2 - I and P, one before and one after in

temporal order

MPEG-4

Page 14: Cache Optimization for Mobile Devices Running Multimedia Applications

Consider GOP with 7 picture framesDue to dependencies frames are processed

in non temporal orderThe encoding, transmission and decoding

order should be the same2 parameters M & N specified at encoder◦ I frame decoded every N frames◦P frame decoded every M frames◦Rest are B frames

Consider the simplified bit stream hierarchical structure

MPEG-4

Page 15: Cache Optimization for Mobile Devices Running Multimedia Applications

N=7 & M=3

B2

B3

P4

B5

B6

P7

I1

Bidirectional Prediction

Prediction

MPEG-4

Page 16: Cache Optimization for Mobile Devices Running Multimedia Applications

Sequence Header GOP …. GOP

GOP Header Picture …. Picture

Picture Header Slice …. Slice

Slice Header Macro-block …. Macro-block

Macro-block Header Block …. Block

MPEG-4

Page 17: Cache Optimization for Mobile Devices Running Multimedia Applications

Decoder reads data as stream of bitsEach section identified by unique bit patternGOP contains at least one I- frame and

dependent P and B framesThere are dependencies while decoding the

encoded videoSo, selecting right cache parameters

improves cache performance significantlyHence Cache Optimization is important

MPEG-4

Page 18: Cache Optimization for Mobile Devices Running Multimedia Applications

ARCHITECTURE

Page 19: Cache Optimization for Mobile Devices Running Multimedia Applications

Cache Design ParametersCache Size:

Most significant design parameter Usually increased by factors of two Increasing cache size shows improvement Cost & space constraints - critical design decision

Line Size: Larger line size – lower miss rates, superior spatial locality Sub-block placement helps decouple size of cache lines &

memory bus More data to be read and written back on a miss Minimal memory traffic with small lines

Architecture

Page 20: Cache Optimization for Mobile Devices Running Multimedia Applications

Associativity: Better performance by increasing associativity for

small caches Going from direct mapped to 2-way may reduce

memory traffic by 50% for small cache size Sizes greater than 4 show minimal benefit across all

cache sizes

Multilevel Caches: CL2 cache between CL1 and main memory

significantly improves CPU performance CL2 addition decreases bus traffic and latency

Architecture

Page 21: Cache Optimization for Mobile Devices Running Multimedia Applications

Simulated ArchitectureArchitecture

Page 22: Cache Optimization for Mobile Devices Running Multimedia Applications

DSP decoded encoded video streamCL1 is split cache with D1 and I1CL2 is unified cacheDSP and main memory connected via

shared busDMA I/0 transfers & buffers data from

storage to main memoryDSP decodes and writes video streams to

main memoryCPU reads and writes into main memory

through its cache hierarchy

Architecture

Page 23: Cache Optimization for Mobile Devices Running Multimedia Applications

SIMULATION

Page 24: Cache Optimization for Mobile Devices Running Multimedia Applications

Simulation ToolsCachegrind – from Valgrind◦ It is a ‘cache profiler’ simulation package◦ Performs detailed simulation of D1, I1, CL2 caches ◦ Gives the total references, misses, miss rates◦ It is useful for programs written in any language

VisualSim◦ Provides block libraries for CPU, caches, bus, main memory◦ Simulation model developed by selecting appropriate

blocks and making connections ◦ Has functionalities to run model and collect results

Simulation

Page 25: Cache Optimization for Mobile Devices Running Multimedia Applications

MPEG-4 WorkloadWorkload defines all possible operating

scenarios and environmental conditionsQuality of workload is important for

simulation accuracy and completenessIn the simulation D1, I1 and CL2 hit ratios

are used to model the systemThis data is obtained from Cachegrind

and used by VisualSim simulation model

Simulation

Page 26: Cache Optimization for Mobile Devices Running Multimedia Applications

Cache Sizes Line Size D1 Refs (K) I1 Refs (K) CL1 Refs

D1 (KB) I1 (KB) CL2

(KB)(B)

bytes Total Miss Total Miss D1 % I1 %

8 8 128 16 18782 521 38758 512 33 67

16 16 512 32 18782 430 38758 106 33 67

32 32 2048 64 18782 403 38758 39 33 67

Different combinations of D1, I1 and CL2 are used About 33% references are data and 67% are instructions As cache size & line size increase, miss rate decreases

Level 1 Data and Instruction References

Simulation

Page 27: Cache Optimization for Mobile Devices Running Multimedia Applications

Cache Sizes Line Size CL1 Hits CL2 Hits

D1 (KB) I1 (KB) L2 (KB) (B) D1 % I1 % %

8 8 128 16 95.0 98.0 99.3

16 16 512 32 96.4 98.6 99.9

32 32 2048 64 98.0 99.5 100

Calculated hit rates for various sizes of CL1 and Cl2 caches

As cache size increases, hit rate increases

D1, I1 and CL2 hit ratios

Simulation

Page 28: Cache Optimization for Mobile Devices Running Multimedia Applications

CL2 Size D1 references D1 References

(KB) Read (K) Write(K) R % W %

32 12391 6391 67 33

128 12391 6391 67 33

512 12391 6391 67 33

2048 12391 6391 67 33

About 67 % of references are reads and about 33 % of references are writes

Read and Write References

Simulation

Page 29: Cache Optimization for Mobile Devices Running Multimedia Applications

Input ParametersItem ValueCL1 Cache sizes 8+8 to 32+32 KBCL2 Cache Sizes 32 to 4096 KBLine Size 16 to 256 BAssociativity 2-way to 16-wayCache Levels L1 and L2Simulation Time 2000.0 simulation time unitsTask Time 1.0 simulation time unitsTask Rate Task Time * 0.4CPU Time Task Time * 0.4Mem Time Task Time * 0.4Bus Time Mem Time * 0.4CL1 Cache Time Mem Time * 0.2CL2 Cache Time Mem Time * 0.4Main Memory Time Task TimeBus Queue Length 300

Simulation

Page 30: Cache Optimization for Mobile Devices Running Multimedia Applications

AssumptionsDedicated bus between CL1 and CL2

introduces negligible delay compared to the bus connecting CL2 and memory

Write back update policy is implemented, so CPU is released immediately after CL1 is updated

Task time has been divided proportionally among CPU, main memory, bus, L1 and L2 cache

Simulation

Page 31: Cache Optimization for Mobile Devices Running Multimedia Applications

Performance Metrics2 performance metricsUtilization◦CPU Utilization is ratio of time that CPU spent

computing to time that CPU spent transferring bits and performing un-tarring and tarring functions

Transactions◦Total number of transactions performed is the

total umber of tasks performed by a component during simulation

Simulation

Page 32: Cache Optimization for Mobile Devices Running Multimedia Applications

RESULTS

Page 33: Cache Optimization for Mobile Devices Running Multimedia Applications

Miss rate variation due to CL1 size changing keeping CL2 size constant

Not much benefit by using CL1 greater than 8+8

Results

Page 34: Cache Optimization for Mobile Devices Running Multimedia Applications

Effect on miss rate due to changing CL2 cache size From 32KB to 512KB miss rate decreases slowly From 512KB to 2MB miss rate decreases sharply Form 2MB to 4MB miss rate almost unchanged

From cost, space and complexity standpoint larger CL2 does not provide significant benefits

Results

Page 35: Cache Optimization for Mobile Devices Running Multimedia Applications

For smaller cache size like D1, miss rate starts decreasing or hit rates start increasing with increase in line size

Miss rates start increasing after a point called ‘cache pollution point’ From 16 to 64B, larger line size gives better spatial locality From 128B does not show improvement as on a miss more data has

to be read and written

Results

Page 36: Cache Optimization for Mobile Devices Running Multimedia Applications

Miss rate significant decreases when going from 2-way to 4-way

Not much significant improvement for 8-way and higher

Results

Page 37: Cache Optimization for Mobile Devices Running Multimedia Applications

32K 128K 256K 512K 1M 2M

CPU 10K 10K 10K 10K 10K 10K

CL1 10K 10K 10K 10K 10K 10K

CL2 303 303 303 303 303 303

Bus 3 3 2 2 1 0

MM 3 3 2 2 1 0

CL1: 8+8 size, 16B Line Size, 4-way set associativityCL2 size varied from 32KB to 4MBCPU Utilization and Transactions collected

Total Transactions for different CL2 Sizes

Results

Page 38: Cache Optimization for Mobile Devices Running Multimedia Applications

Memory requests initiated by CPU referred to CL1Then to CL2 and finally unsuccessful requests to Main

Memory MM transactions decrease with increase in CL2 sizeAll tasks initiated at CPU referred to CL1Considering 10000 tasks, 3333 data and 6667 instructionsFor D1 hit ratio 5% and I1 hit ratio 2%

◦ 168+135 = 303 go to CL2For CL2 32KB, miss ratio 0.9%

◦ Only 3 tasks go to MMFor CL2 2MB+, miss ratio 0%

◦ No tasks go to MM

Results

Page 39: Cache Optimization for Mobile Devices Running Multimedia Applications

CPU Utilization decreases with increase in CL2 sizeBetween 512KB and 2MB decrement is significantFor 128KB and smaller or 4MB and bigger, the

change is not significant

Results

Page 40: Cache Optimization for Mobile Devices Running Multimedia Applications

CONCLUSION Focused on enhancing MPEG-4 decoding using cache

optimization for mobile devices Used Cachegrind and VisualSim simulation tools Optimize cache size, line size, associativity and cache levels Simulated architecture consists of and 2 level cache Collected references form Cachegrind to drive VisualSim

simulation model

Future Scope: Improve system performance further by using techniques like Selective Caching, Cache Locking, Scratch Memory, Data Recording

Page 41: Cache Optimization for Mobile Devices Running Multimedia Applications

QUESTIONS