Catching Accurate Profiles in Hardware Satish Narayanasamy,
Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese
Presented by Jelena Trajkovic ICS 280/259
Slide 2
Outline Introduction & Motivation Goal Related Work
(Stratified Sampler) Interval-based Profiling for a Single Hash
Profiler Experimental results Multiple-hash Profiler Experimental
results
Slide 3
Introduction & Motivation SW used to gather program
behavior information Architectural support for generating profiles
at run-time HW is used to assist SW, dependent on on system SW (for
management or aggregation of events) HW-only profiler
Slide 4
Introduction & Motivation (cont.) HW optimizations that can
take advantage of info gathered in run-time: Cache replacement
& prefetching identifying loads that cause majority of misses
Value based optimization 50% of memory accesses are dominated by 10
distinct values capture this dynamically? => this information is
used for storing compressed values in data cache Trace formation
dynamically extracting and ordering frequently executed code =>
I-fetch more efficient Multiple path execution find branches that
are hard to predict and execute down multiple paths
Slide 5
Goal The goal is to build a profiling scheme that satisfies
following properties: Area Efficient capacity constraints (fixed
amount of area) Accurate identify important / frequent events and
count them accurately Timely up-to-date information about program
behavior Performance Efficiency and SW Independence independent of
system SW support to manage profiles (accumulate and analyze
events), identifying in HW
Slide 6
Related Work SW profiling Binary instrumentation (ATOM by
Calder et al.) HW counter assisted profiling DCPI system for Alpha
Processors HW table based profiling Stratified sampling (Sastry et
al.) Co-processor profiler Distill information passed from main
processor (Ziles and Sohi)
Slide 7
Profiling Events Profiling event: combination of several
variables instruction PC, load address, register value or name,
cache miss Tuple represents event as combination of 2
variables
Slide 8
Related Work: Stratified Sampler Divides the original input
stream into multiple streams via hashing (independently sampled)
Table of counters number of occurrences of different events counter
is selected by applying hash function on the input event
incremented when event appears in the input stream on reaching
threshold value, counter is reset and event is reported (interrupt
to the OS)
Slide 9
Related Work: Stratified Sampler (cont.) To reduce aliasing and
improve accuracy: Partial tags, miss counters, state information
Hit counters number of occurrences Miss counters tuple hashes to
particular entry, but tag differs (replacement policy) On reaching
threshold value: Generate interrupt Buffered, interrupt is sent
when buffer fills up Placed in associative counter table, passed to
SW (via intermediate buffer) Accumulating information in SW (5%
interrupt overhead)
Slide 10
Interval-based Profiling for a Single Hash Profiler Removing
SW: accumulator table Interval-based significant number of
occurrences within interval reset hash-table counters after every
interval improving accuracy - shielding Divide execution time into
intervals interval length fixed number of profiling events (tuples)
capture only events (candidate tuples) that occur more than
candidate threshold (% of interval length)
Slide 11
Single Hash Architecture accumulator table is fully associative
and tagged if (input tuple is in acc. table ) inc counter else hash
into hash-table increment corresponding counter hash-table does not
contain tags aliasing if (tuple reaches candidate threshold value)
if (acc. table is not full) acc. table is allocated mark entry as
non-replicable till the end of interval particular entry is not
given as an input to the hash-table shielding if (end of the
interval) flush hash-table mark all entries in acc. table as
replaceable
Slide 12
Single Hash Architecture (cont.) Calculate worst case number of
entries in the acc. table (avoid capacity and aliasing issues) as a
function of profile interval length and candidate threshold number
of events that determine profiling interval number of occurrences
in order to get recorded in acc. table (percentage of interval
length) e.g. interval length = 10,000 candidate threshold = 1%
=> 100 entries 0.1% => 1,000 entries 10,000 w/ 1% and 1
million w/ 0.1% Hash-table 2K entries
Slide 13
Single Hash Architecture (cont.) Hash functions: for a given
tuple npc = flip(randomize(pc)) nv = randomize(value) index =
xor-fold(npc xor nv, index-size) Optimizations: Retaining: keeps
top entries in acc. table from the previous interval Resetting:
reset counter in hash-table, after it reaches candidate
threshold
Slide 14
Experimental setup SPEC95:go, li, vortex; SPEC2K: gcc, vortex;
deltablue, sis, burg Compilation: DEC Alpha 21164, DEC C (full
optimizations) Profiling analysis: ATOM Fast forwarded and then ran
for 500 million instructions
Slide 15
Error Calculation For each interval compare candidates seen by
HW profiler and perfect profiler False Positive False Negative
Neutral Positive Neutral Negatives Total error rate for an
interval
Slide 16
Experimental Results Accuracy of HW profiling depends number of
unique tuples in an interval (distinct tuples) number of unique
tuples that cross threshold Analysis of candidate tuples Number of
distinct tuples seen in an interval on average
Slide 17
Number of unique candidate tuples in an interval on
average
Slide 18
Percentage of variation of candidates from one interval to the
next
Slide 19
Error rates Single Hash table with retaining/resetting results
across a set of benchmarks
Slide 20
Multiple-hash Profiler Independent hash functions (for each
table) if(no entry in acc. table) hash to each table update each
counter if(all entries for particular tuple in hash table reach
candidate threshold) add entry to the acc. table reset counters in
hash-table (immediately or at the end of interval) Conservative
update update just smallest counter
Slide 21
Muti-hash profiler for an interval of 10,000, 1% candidate
threshold, and a total number of 2K hash-table entries Muti-hash
profiler for an interval of 1 million, 0.1% candidate threshold,
and a total number of hash-table entries of 2K
Slide 22
Varying number of hash tables for the best muti-hash profiler -
C1, R0 (w/ conservative update and w/o resetting) (10,00, 1% - L;
1mill, 0.1% - R) Variation in the error across different intervals
(BSH w/ resetting - L; multi- hash w/ conservative update and no
resetting 4hash tables - R)
Slide 23
Summary Profiling architecture Efficiently filters out
important data Efficient in terms of HW cost (6KB + (1KB or 10 KB)
and overhead (no performance overhead)