Milena Milenković † , Aleksandar Milenković ‡ , Martin Burtscher ¥ † WBI Performance, IBM Austin

Algorithms and Data Structures forUnobtrusive Real-time Compression of

Instruction and Data Address Traces

Milena Milenković†, Aleksandar Milenković‡, Martin Burtscher¥

† WBI Performance, IBM Austin‡ Electrical and Computer Engineering Department

The University of Alabama in Huntsville¥ Computer Systems Laboratory, Cornell University

Email: [email protected] Web: http://www.ece.uah.edu/~milenka

http://www.ece.uah.edu/~lacasa

2

Outline

Program Execution Traces: An Introduction Problems and Existing Solutions Trace Compression in Hardware Instruction Address Trace Compression Data Address Trace Compression Results Conclusions

3

Program Execution Traces: An Introduction

Streams of recorded events Basic block traces Address traces Instruction words Operands ...

Trace uses Computer architects for evaluation of new architectures Computer analysts for workload characterization Software developers for program tuning, optimization, and

debugging Trace issues

Trace collection Trace reduction Trace processing

4

Program Execution Traces: An Introduction

for(i=0; i<100; i++) {

c[i] = s*a[i] + b[i];

sum = sum + c[i];

}

2 0x020001f4

0 0x020001f8 0xbfffbe24

0 0x020001fc 0xbfffbc94

2 0x02000200

2 0x02000204

2 0x02000208

2 0x0200020c

1 0x02000210 0xbfffbb04

2 0x02000214

@ 0x020001f4: mov r1,r12, lsl #2

@ 0x020001f8: ldr r2,[r4, r1]

@ 0x020001fc: ldr r3,[r14, r1]

@ 0x02000200: mla r0,r2,r8,r3

@ 0x02000204: add r12,r12,#1 (1 >>> 0)

@ 0x02000208: cmp r12,#99 (99 >>> 0)

@ 0x0200020c: add r6,r6,r0

@ 0x02000210: str r0,[r5, r1]

@ 0x02000214: ble 0x20001f4

InstructionAddress

DataAddressType

Dinero+ Execution Trace

5

Outline

Program Execution Traces: An Introduction Problems and Existing Solutions Trace Compressor in Hardware Instruction Address Trace Compression Data Address Trace Compression Results Conclusions

6

Problems

Problem #1: traces are very large In terabytes for a minute of program execution Expensive to store, transfer, and use Multiple cores on a single chip +

more detailed information needed (e.g., time stamps of events) => Need trace compression

Problem #2: debugging is far from fun Stop execution on breakpoints, examine the state Time-consuming, difficult,

may miss a critical state leading to erroneous behavior Stopping the CPU may perturb the sequence of events

making your bugs disappear => Need an unobtrusive tracing mechanism

7

Existing Trace Compression Techniques

Effective reduction techniques: lossless, high compression ratio, fast compression/decompression

General purpose compression algorithms Ziv-Lempel (gzip) Burroughs-Wheeler transformation (bzip2) Sequitur

Trace specific compression techniques (VPC/TCGEN, SBC, LBTC, Mache, PDATS)

Tuned to exploit redundancy in traces Better compression, faster, can be further combined with

general-purpose compression algorithms Problem: They are targeting software implementations;

But we need real-time, unobtrusive trace compression

8

Outline


9

Trace Compression in Hardware

Goals Small on-chip area and small number of pins Real-time compression (never stall the processor) Achieve a good compression ratio

Solution A set of compression algorithms

targeting on-the-fly compression of instruction and data address traces

10

Trace Compressor: System Overview

SCIT

Stream Cache(SC)

Data Address Stride Cache (DASC)

2nd LevelCompressor

Processor Core

SCMT DT DMT

Program

Counter

Data Address

Task Switch

Trace Output Controller

To External Unit

DAPC

Data Address

Buffer

Data Repetitions

Processor Core

Memory

Trace Compressor

System Under Test

Trace port

External Trace Unitfor Storing/Processing(PC or intelligent drive)

11

Outline


12

Instruction Address Trace Compression

Detect instruction streams Def.: An instruction stream is defined as a sequential run of

instructions, from the target of a taken branch to the first taken branch in the sequence

Our previous study showed that the number of unique streams in an application is fairly limited (ACM TOMACS’07)

The average number of instructions in an instruction stream is 12 for SPEC CPU2000 integer applications and 117 for SPEC CPU 2000 floating-point applications (ACM TOMACS’07)

(S.SA, S.L) uniquely identify an instruction stream Compress an instruction stream by replacing it

with the corresponding stream cache index

13

Stream Detector + Stream Cache

F(S.SA, S.SL)

iSet

Hit/Miss

SCMT (SA, SL) SCIT

’00…0’S.SA & S.L

Stream Cache (SC)

NSET - 1

…NWAY - 1

=?

iWay

S.SA & S.LFrom InstructionStream Buffer

Stream Cache Index Trace

Stream Cache Miss Trace

iWay

PC

PPC

-

S.SA S.L

SA

=! 4

SL

Instruction Stream Buffer

SA

SA

0

1

i

01

reserved

SA L

14

Instruction Trace Compression:An Analytical Model

Legend: CR(SC.I) – Compression ratio for the

instruction component Itrace – Instruction Address Trace SL.Dyn – Average stream length

(dynamic) N – Number of instructions SC.Hit(Nset,Nway) - Stream cache

hit rate with NsetNway entries Stream cache has NsetNway entries

=> Log2(NsetNway) bits for SCIT components

).1(5)(log8

1.4

).(

5).1(.

)(

8

)(log

.)(

4)(

)()(

)().(

2

2

WAYSNSETNWAYSSET

WAYSNSETN

WAYSSET

HitSCNN

DynSLISCCR

BytesHitSCDynSL

NSCMTSize

BytesNN

DynSL

NSCITSize

BytesNItraceSize

SCMTSizeSCITSize

ItraceSizeISCCR

DynSLISCCRLimNN

DynSLISCCRLimNN

DynSLISCCRLimNN

NN

DynSLLimISCCRLim

HitSCWAYSSET

HitSCWAYSSET

HitSCWAYSSET

WAYSSETHitSCHitSC

.34.5)).((64

.57.4)).((128

.4)).((256

)(log

.32)).((

1.

1.

1.

21.1.

15

2nd Level Instruction Address Trace Compression

Observation: a small number of streams that exhibit a very strong temporal locality

Consequences High stream cache hit rates =>

Size(SCIT) >> Size(SCMT) A lot of redundancy in the SCIT stream

How could we exploit this? N-tuple Compression Using N-Tuple History Table

16

N-tuple Compression Using Tuple History Table

SCIT Trace

N-tuple History Table(FIFO)

==?’00…0’ index

1

MaxT-1

Hit/Miss

TUPLE.HIT Trace TUPLE.MISS Trace

N-tuple Input Buffer

17

Outline


18

Data Address Trace Compression

More challenging task Data addresses rarely stay constant

during program execution However, they often have a regular stride Proposed approach exploits locality

of memory referencing instructions and regularity in data address strides

Use new structure Data Address Stride Cache (DASC)

19

index

PC

Data Address Stride Cache (DASC)

0

1

i

N - 1

… …

… …

LDA Stride

LDA-DA

G(PC)

DA

==?’0’ ’1’

DT (Data trace)

DMT Data Miss Trace

Stride.Hit

Tagless Data Address Stride Cache

Stride.Hit

20

Tagless DASC Compression Ratio: An Analytical Model

tAddresssHiDASCDSCCR

BStrideHitDASCNDMTSizeDTSize

BNDtraceSize

DMTSizeDTSize

DtraceSizeDSCCR

memref

memref

.03125.1

1).(

)]125.04).1[()()(

4)(

)()(

)().(

3203125.0

1)).((

1.

DSCCRLim

AddressHitDASC

Legend: CR(SC.D) -- Compression ratio for data address trace Dtrace – Data Address Trace Nmemref – Number of memory referencing instructions DASC.AddressHit – Hit rate in the data address stride cache

21

2nd Level Data Address Trace Comp.

DT

Prev.DT

=?CNT

Data Header(DH)

Data Repetition Trace (DRT)

// Detect data repetitions1. Get next DT byte; 2. if (DT == Prev.DT) CNT++;3. else {4. if (CNT == 0) {5. Emit Prev.DT to DRT;6. Emit ‘0’ to DH;7. } else {8. Emit (Prev.DT, CNT) pair to DRT;9. Emit ‘1’ to DH;}10. Prev.DT = DT;

22

Outline


23

Experimental Evaluation

Goals Assess the

effectiveness of the proposed algorithms

Explore the feasibility of the proposed hardware implementations

Workload 16 MiBench

bechmarks ARM architecture

IC NUS maxSL SL.Dyncjpeg 104,607,812 1636 239 10.89djpeg 23,391,628 1324 206 21.81lame 1,285,111,635 3410 252 27.81tiff2bw 143,254,646 1058 43 12.79tiff2rgba 151,691,275 1146 75 27.54tiffmedian 541,260,067 1431 75 22.22tiffdither 832,951,018 1831 51 12.57mad 286,974,899 1659 1055 20.09sha 140,885,982 495 62 15.15bf_e 544,053,846 413 300 5.85rijndael_e 319,977,971 542 254 18.94ghostscript 708,090,638 6900 187 8.70rsynth 824,942,227 1323 180 15.77stringsearch 3,675,745 439 62 5.61adpcm_c 732,513,651 347 71 54.63gsm_d 1,299,270,245 845 401 11.07

24

Findings about SC Size/Organization

SC with 128 entries CR(32x4) = 54.139,

CR(16x8) = 57.427 SC with 256 entries

CR(64x4) = 53.6 But even smaller SCs

work very well 64 entries: CR(8x8) = 47.068,

CR(16x4) = 44.116 16 entries: CR(8x2) = 22.145

Associativity Higher is better for very small SCs

(direct mapped is not an option) Less important for larger SCs

SC.Hit WaysEntries 1 2 4 8

8 55.47 59.67 61.06 59.5416 67.35 71.22 74.58 73.6032 73.99 79.51 82.45 82.8264 80.75 88.28 91.44 93.08

128 84.62 94.27 97.26 98.33256 85.98 97.05 99.08 99.08

CR(SC.I) WaysEntries 1 2 4 8

8 16.33 17.59 16.99 15.7916 21.10 22.15 27.81 26.6132 23.88 28.02 34.40 33.9664 27.54 36.89 44.12 47.07

128 28.95 47.57 54.14 57.43256 28.05 47.81 53.60 54.24

25

SC + N-tuple Compression Ratio

DEF. FAST BEST BEST DEF.SC.I SC.I+NT I.GZ I.GZ I.GZ I.BZ2 I.GZGZ

cjpeg 47.98 147.56 109.58 54.53 124.45 341.96 265.66djpeg 87.35 188.53 71.78 39.85 73.70 201.98 232.47lame 100.68 158.10 60.46 128.53 333.88 87.61 174.24tiff2bw 54.91 235.05 114.11 83.94 114.42 376.83 615.21tiff2rgba 117.53 407.14 121.30 20.26 121.98 529.62 1292.74tiffmedian 95.91 414.37 152.81 92.32 155.47 472.93 1017.48tiffdither 43.45 65.48 91.09 46.35 99.84 170.88 147.09mad 81.52 177.84 73.46 37.82 78.52 94.31 206.25sha 69.24 440.35 211.43 54.42 221.75 656.53 4112.08bf_e 25.57 98.46 170.38 40.95 182.25 352.02 4065.94rijndael_e 85.17 454.63 143.82 12.56 150.62 141.77 2392.86ghostscript 26.57 50.91 100.64 39.68 111.24 212.54 434.53rsynth 56.42 91.83 46.71 30.61 48.02 143.22 191.22stringsearch 16.92 24.22 82.06 32.34 100.63 202.47 132.76adpcm_c 249.71 1583.96 233.12 107.34 233.63 1862.63 12764.68gsm_d 46.79 174.57 85.37 59.22 87.17 165.58 507.06TOTAL 54.14 125.90 87.45 47.24 112.91 171.97 321.58

26

DASC Compression Ratio

DASC DASC DASC DASC DASC DASC DEF. FAST BEST32 64 128 256 512 1024 D.GZ D.GZ D.GZ D.BZ2 D.GZGZ

cjpeg 3.35 4.60 5.14 5.77 6.54 7.11 5.98 4.50 6.11 18.20 9.57djpeg 2.81 3.57 4.28 4.96 5.22 5.29 4.22 3.78 4.22 8.62 4.92lame 1.20 1.52 2.81 3.82 4.49 4.88 6.56 4.01 6.63 8.80 8.60tiff2bw 76.31 78.04 84.28 105.04 128.84 134.23 2.14 2.55 2.10 14.28 3.07tiff2rgba 5.98 79.81 91.24 107.49 127.05 139.57 2.10 2.79 2.09 4.06 4.03tiffmedian 8.64 8.70 8.74 8.81 8.87 8.89 4.40 4.37 4.53 11.16 6.03tiffdither 2.61 6.08 7.21 8.69 9.65 10.06 4.51 4.41 4.51 7.87 6.77mad 1.30 1.59 1.96 2.07 2.35 2.64 4.08 3.60 4.22 13.47 6.97sha 6.58 7.94 9.38 10.79 11.36 11.36 44.91 8.36 45.61 172.71 591.69bf_e 1.58 1.95 2.38 2.61 2.75 2.91 7.58 4.86 7.83 16.35 9.08rijndael_e 1.10 1.10 1.10 1.13 1.29 2.06 4.24 3.22 4.27 7.31 4.49ghostscript 1.07 1.19 1.56 2.19 2.93 5.27 27.21 18.58 27.46 47.42 40.83rsynth 1.22 1.36 1.76 3.81 8.30 32.43 24.44 21.46 25.27 57.40 43.88stringsearch 1.80 2.04 2.70 4.13 4.44 5.16 11.12 8.57 11.23 15.03 11.47adpcm_c 3.13 3.13 3.13 3.13 3.13 3.13 6.57 3.64 7.15 12.27 11.42gsm_d 2.67 4.48 11.30 13.60 14.81 16.78 21.60 18.05 23.29 63.53 33.15TOTAL 1.66 2.04 2.80 3.77 4.67 6.12 6.78 5.51 6.90 13.29 9.70

27

Hardware Complexity Estimation

CPU model In-order, Xscale like Vary SC and

DASC parameters SC and DASC timings

SC: Hit latency = 1 cc, Miss latency = 2 cc

DASC: Hit latency = 2 cc Miss latency = 2 cc

To avoid any stalls Instruction stream input buffer:

MIN = 2 entries Data address input buffer:

MIN = 8 entries Results are relatively independent

from SC and DASC organization

Component Entries Complexity Bytes

Instruction stream buffer

2 2x5 10

Stream detector 2 2x4 8

Stream cache 32x4 128x5 640

N-tuple history buffer 255 255x8*(7/8) 1785

Data address buffer 8 8x8 64

Data address stride cache

1024 1024x5 5120

Data repetitions state machine

- 2 2

28

Outline


29

Conclusions

Contribution: A set of algorithms for instruction and data address trace compression

Enabling real-time trace compression Low complexity (small structures, small number of external pins) Excellent compression ratio

Proposed mechanism Stream Caches + Ntuple for instruction address traces Data Address Stride Cache + Data Repetitions for data address traces

Analytical & simulation analysis focusing on Compression ratio (bits/instructions) Optimal sizing/organization of the structures

Findings The proposed mechanism outperforms FAST GZ software

implementation with relatively small structures (32x4 SC, 1024x1 DASC) Performs as well as DEFAULT GZ software implementation

when N-tuple and Data repetitions are included

Appendix

31

Detect and Compress An Ins. Stream

Detect a new instruction stream1. Get next PC; 2. ndiff = PC – PPC; 3. if (ndiff != 4 or SL == MaxS) {4. Place (SA & SL) into the instruction stream buffer;5. SL = 1;6. SA = PC;7. } else SL++;8. PPC = PC;Compress instruction stream1. Get the next instruction stream record

from the instruction stream buffer(S.SA, S.SL);2. Lookup in the stream cache with iSet = F(S.SA, S.SL);3. if (hit) 4. Emit(iSet && iWay) to SCIT; 5. else {6. Emit reserved value 0 to SCIT;7. Emit stream descriptor (S.SA, S.SL) to SCMT;8. Select an entry (iWay) in the iSet set to be replaced;9. Update stream cache entry: SC[iSet][iWay].Valid = 1 10. SC[iSet][iWay].SA = S.SA, SC[iSet][iWay].SL = S.SL;}11.Update stream cache replacement indicators;

32

N-tuple Compression Using Tuple History Table (THT)

1. Get the next SCIT2. if (N-tuple incoming stream buffer is full) { 3. Lookup in the Tuple History Table (THT);4. if (hit) {5. Emit(index in the THT) to the Tuple.Hit trace;6. // emit the first index found in the buffer7. } else {8. Emit(0) to Tuple.Hit trace;9. Emit(N-tuple) to Tuple.Miss trace;}10. Update the Tuple History Table; }

33

Data Address Compression: Tagless DASC

// Compress data address stream1. Get the next pair from data buffers (PC, DA)2. Lookup in the data address stream cache indexSet = G(PC);3. cStride = DA - DASC[iSet].LDA;4. if (cStride == DASC[iSet].Stride) {5. Emit(‘1’) to DT; //1-bit info 6. } else {7. Emit(‘0’) to DT;8. Emit DA to DMT;9. DASC[iSet].Stride =lsb(cStride);}10. DASC[iSet].LDA = DA;

Milena Milenković † , Aleksandar Milenković ‡ , Martin Burtscher ¥ † WBI Performance, IBM Austin

Documents