MICROARCHITECTURE-INDEPENDENT WORKLOAD ...kehoste/ELIS/pub/hoste07micro...Ofthe118benchmarksinourdataset,we compare two for our case study: SPEC-cpu2000’s gzip-graphic and BioInfoMark’s

........................................................................................................................................................................................................................................................

MICROARCHITECTURE-INDEPENDENTWORKLOAD CHARACTERIZATION

........................................................................................................................................................................................................................................................

FOR COMPUTER DESIGNERS, UNDERSTANDING THE CHARACTERISTICS OF WORKLOADS

RUNNING ON CURRENT AND FUTURE COMPUTER SYSTEMS IS OF UTMOST IMPORTANCE

DURING MICROPROCESSOR DESIGN. A MICROARCHITECTURE-INDEPENDENT METHOD

ENSURES AN ACCURATE CHARACTERIZATION OF INHERENT PROGRAM BEHAVIOR AND

AVOIDS THE WEAKNESSES OF MICROARCHITECTURE-DEPENDENT METRICS.

......The workloads that run on ourcomputer systems are always evolving. Soft-ware companies continually come up withnew applications, many triggered by theincreasing computational power available. Itis important that computer designers un-derstand the characteristics of these emergingworkloads to optimize systems for theirtarget workloads. Moreover, the need fora solid workload characterization methodol-ogy is increasing with the shift to chipmultiprocessors—especially heterogeneousCMPs with various cores specialized forparticular types of workloads.

Computer architects and performanceanalysts are well aware of the workload driftphenomenon, and to address it, they typicallycollect benchmarks to represent emergingworkloads. Examples of recently introducedbenchmark suites covering emerging work-loads are MediaBench for multimedia work-loads, MiBench and EEMBC (EmbeddedMicroprocessor Benchmark Consortium,http://www.eembc.org) for embedded work-loads, BioMetricsWorkload for biometricsworkloads, and BioInfoMark and BioPerf forbioinformatics workloads.1–5 A key questionis how different these workloads are fromexisting, well-known benchmark suites. An-swering this question is important for two

reasons. First, it provides insight into whethernext-generation microprocessors should bedesigned differently from today’s machines toaccommodate the emerging workloads. Sec-ond, if the new workload domain is notsignificantly different from existing bench-mark suites, there is no need to include thenew benchmarks in the design process—simulating the additional benchmarks wouldonly add to the simulation time withoutproviding additional information.

To find out how different the emergingworkloads are from existing benchmarksuites, computer designers usually comparethe characteristics of the emerging work-loads with the characteristics of well-knownbenchmark suites. A typical approach is tocharacterize the emerging workload in termsof microarchitecture-dependent metrics.For example, most workload characteriza-tion studies run benchmarks representingthe emerging workload on a given micro-processor while measuring program char-acteristics with hardware performance coun-ters. Other studies use simulation to derivesimilar results. The program characteristicstypically measured include instruction mixand microarchitecture-dependent character-istics such as instructions per cycle (IPC),cache miss rates, branch misprediction rates,

Kenneth Hoste

Lieven Eeckhout

Ghent University

0272-1732/07/$20.00 G 2007 IEEE Published by the IEEE Computer Society.

........................................................................

63

and translation look-aside buffer (TLB)miss rates. These studies conclude eitherthat two workloads are dissimilar if theirhardware performance counter characteris-tics are dissimilar, or that two workloads aresimilar if their hardware performancecounter characteristics are similar.

A major pitfall of a microarchitecture-dependent workload characterization meth-odology is that it can hide underlying,inherent program behavior. To avoid thispitfall, we advocate characterizing work-loads in a microarchitecture-independentmanner to capture their true inherentprogram behavior. This article presentsa microarchitecture-independent workloadcharacterization methodology and demon-strates its usefulness in characterizing bench-mark suites for emerging workloads.

Pitfall of microarchitecture-dependentworkload characterization

Before presenting our methodology, wepresent a case study that illustrates theproblem with microarchitecture-dependentworkload characterization. We considered118 benchmarks from six benchmark suites:SPECcpu2000, MediaBench, MiBench,BioInfoMark, BioMetricsWorkload, and

CommBench. Throughout the article, werefer to a benchmark-input pair as a bench-mark. For each benchmark, we collectedmicroarchitecture-dependent characteristicswith hardware performance counters usingthe Digital Continuous Profiling Infrastruc-ture (DCPI) tool on two hardware plat-forms: an Alpha 21164A (EV56) and anAlpha 21264A (EV67) machine. Thecharacteristics we collected are cycles perinstruction (CPI) on both the EV56 and theEV67; and the L1 D-cache, L1 I-cache, andL2 cache miss rates on the EV56.

We also measured a set of microarchi-tecture-independent characteristics in sixcategories. Table 1 summarizes these char-acteristics; we present a more detaileddescription elsewhere.6 These microarchi-tecture-independent characteristics can becollected through binary instrumentationusing tools such as ATOM (which weuse), Pin, Valgrind, and DynamoRIO.These characteristics are microarchitecture-independent but not independent of theinstruction set architecture (ISA) and thecompiler. In previous work, however, weobserved that these characteristics providea fairly accurate characterization picture,even across platforms.7

Table 1. Microarchitecture-independent characteristics collected to characterize workload behavior.

Characteristic category Measurement Description

Instruction mix 6 percentages Percentage of loads, stores, branches, arithmetic operations, multiplies,

and floating-point operations.

Instruction-level parallelism

(ILP)

4 values IPC achievable for an idealized out-of-order processor (with perfect

caches and branch predictor) for window sizes of 32, 64, 128, and

256 in-flight instructions.

Register traffic 2 values and 7 probabilities Average number of register input operands per instruction, average

number of register reads per register write, distribution (measured in

buckets) of register dependency distance, or number of instructions

between production and consumption of a register instance.

Working-set size 4 numbers Number of unique 32-byte blocks and 4-Kbyte memory pages touched

for both instruction and data streams.

Data stream strides 20 probabilities Distribution, measured in buckets, of global and local strides. Global

stride is the difference in memory addresses between two

consecutive memory accesses; local stride is restricted to two

consecutive memory accesses by the same static instruction.

Strides are measured separately for memory reads and writes.

Branch predictability 4 percentages Branch prediction accuracy for the theoretical prediction-by-partial-

matching (PPM) predictor.8 We considered global and local history

predictors, and per-address and global predictors.

.........................................................................................................................................................................................................................

WORKLOAD CHARACTERIZATION

.......................................................................

64 IEEE MICRO

Of the 118 benchmarks in our data set, wecompare two for our case study: SPEC-cpu2000’s gzip-graphic and BioInfoMark’sfasta. Many pairs of benchmarks could serveas a case study, but we limit ourselves toa single example. Table 2 shows microarch-itecture-dependent and microarchitecture-independent characteristics for the twobenchmarks, as well as the maximum valueobserved across all 118 benchmarks (to putthe values for gzip and fasta in perspective).The microarchitecture-dependent character-istics, CPI and cache miss rates, are fairlysimilar for gzip and fasta (especially com-pared with the maximum values observedacross the entire data set).

The microarchitecture-independent char-acteristics, on the other hand, are quitedifferent. The fasta benchmark’s data work-ing set is one order of magnitude smallerthan that of gzip (although fasta’s dynamicinstruction count is about 6.7 times larger).Memory access patterns are also verydifferent for the two benchmarks. Forexample, the probability of the same staticload addressing the same memory locationat consecutive executions—a local loadstride equal to zero—is more than twice aslarge for gzip as for fasta. The probability ofthe difference in memory addresses ofconsecutive memory writes—the global

store stride—being smaller than 64 is morethan 2.5 times as large for fasta as for gzip.

We conclude that although microarchi-tecture-dependent workload behavior is fair-ly similar, inherent program behavior can bevery different, which can be misleading formicroprocessor design. Although two work-loads behave the same on one microarchi-tecture, they might exhibit different behaviorand performance on other microarchitec-tures. We therefore advocate using micro-architecture-independent program character-istics to characterize workloads. Similarmicroarchitecture-independent behavior im-plies similar microarchitecture-dependentbehavior; dissimilar microarchitecture-dependent behavior implies dissimilar in-herent behavior.

MethodologyAlthough collecting microarchitecture-

independent characteristics is straightforwardwith binary instrumentation tools, analyzingthe collected data is far from trivial. Forexample, our data set is a 118 3 47 datamatrix. That is, there are 118 benchmarks,for which we measure p5 47 characteristics.Obviously, gaining insight into a large dataset is difficult without an appropriate dataanalysis technique. Here, we discuss twopossible techniques. One is a statistical

Table 2. Case study comparing microarchitecture-dependent and -independent characteristics for

benchmarks gzip-graphic and fasta.

Characteristic gzip-graphic fasta All 118 benchmarks (maximum)

Microarchitecture-dependent

CPI on Alpha 21164 1.01 0.92 14.04

CPI on Alpha 21264 0.63 0.66 5.22

L1 D-cache misses per instruction (%) 1.61 1.90 22.58

L1 I-cache misses per instruction (%) 0.15 0.18 6.44

L2 cache misses per instruction (%) 0.78 0.25 17.59

Microarchitecture-independent

Data working set (32-byte blocks) 3,857,693 438,726 31,709,065

Data working set (4-Kbyte pages) 46,199 4,058 248,108

Instruction working set (32-byte blocks) 1,394 3,801 24,377

Instruction working set (4-Kbyte pages) 33 79 341

Probability of local load stride 5 0 0.67 0.30 0.91

Probability of local store stride 5 0 0.64 0.05 0.99

Probability of global load stride # 64 0.26 0.18 0.86

Probability of global store stride # 64 0.35 0.93 0.99

........................................................................

MAY–JUNE 2007 65

............................................................................................................................................................................................................................................................................

Related workA fairly large body of work exists on microarchitecture-independent workload

characterization. The first thread of research on the subject has shown that there

is a strong correlation between executed code and performance.1–5 The

SimPoint tool builds on this notion by selecting sampling units for use during

sampled simulation. SimPoint’s key insight is that execution intervals that

execute similar code behave similarly in terms of various microarchitecture-

dependent program characteristics such as cache miss rates, branch

misprediction rates, and IPC. Code signatures thus allow identification of

microarchitecture-independent program phases.

But code signatures cannot be used for identifying program similarity

across programs. Researchers instead use a collection of program

characteristics to compare benchmarks. For example, Weicker, and Saavedra

and Smith characterize benchmarks at the programming-language level by

counting the numbers of assignments, if-then-else statements, function calls,

loops, and so forth.6,7 More recent work on program similarity applies

statistical data analysis techniques to binary-level program characteristics.

Some researchers use microarchitecture-dependent characteristics only;8

others use a mixture of microarchitecture-dependent and -independent

characteristics;9 yet others use microarchitecture-independent characteris-

tics only.10,11 We use genetic algorithms to learn how microarchitecture-

independent characteristics relate to overall performance.12 This lets us rank

machines for an application of interest according to its inherent program

similarity with a set of previously profiled benchmarks. Yi, Lilja, and Hawkins

present a completely different approach to finding benchmark similarity

based on a Placket-Burman design.13 They classify benchmarks according to

the degree to which they stress various processor structures.

Recent work in workload characterization focuses on better un-

derstanding of how benchmarks evolve over time. For example, Joshi et

al. study how the SPECcpu suites evolved over four generations (cpu89,

cpu92, cpu95, and cpu2000), concluding that none of the inherent program

characteristics changed as dramatically as the dynamic instruction count.14

They also conclude that temporal data locality has become increasingly poor

over time, while other characteristics have remained more or less the same.

Yi et al. go a step further, studying how benchmark drift affects processor

design.15 They conclude that benchmark drift can have significant negative

impact on the performance of next-generation processors running future

workloads if their design is driven solely by yesterday’s benchmarks. In other

words, to ensure that next-generation processors perform well, designers

need an accurate workload characterization methodology to compare

emerging workloads with existing workloads.

References

1. J. Lau, S. Schoenmackers, and B. Calder, ‘‘Structures for

Phase Classification,’’ Proc. Int’l Symp. Performance

Analysis of Systems and Software (ISPASS 04), IEEE CS

Press, 2004, pp. 57-67.

2. T. Sherwood et al., ‘‘Automatically Characterizing Large

Scale Program Behavior,’’ Proc. 10th Int’l Conf. Architec-

tural Support for Programming Languages and Operating

Systems (ASPLOS 02), ACM Press, 2002, pp. 45-57.

3. M. Annavaram et al., ‘‘The Fuzzy Correlation between Code

and Performance Predictability,’’ Proc. 37th Ann. Int’l

Symp. Microarchitecture (MICRO 04), IEEE CS Press,

2004, pp. 93-104.

4. J. Lau et al., ‘‘The Strong Correlation between Code

Signatures and Performance,’’ Proc. IEEE Int’l Symp.

Performance Analysis of Systems and Software (ISPASS

05), IEEE Press, 2005, pp. 236-247.

5. H. Patil et al., ‘‘Pinpointing Representative Portions of Large

Intel Itanium Programs with Dynamic Instrumentation,’’

Proc. 37th Ann. Int’l Symp. Microarchitecture (MICRO 04),

IEEE CS Press, 2004, pp. 81-93.

6. R.P. Weicker, ‘‘An Overview of Common Benchmarks,’’

Computer, vol. 23, no. 12, Dec. 1990, pp. 65-75.

7. R.H. Saavedra and A.J. Smith, ‘‘Analysis of Benchmark

Characteristics and Benchmark Performance Prediction,’’

ACM Trans. Computer Systems, vol. 14, no. 4, Nov. 1996,

pp. 344-384.

8. H. Vandierendonck and K. De Bosschere, ‘‘Experiments

with Subsetting Benchmark Suites,’’ Proc. 7th Ann. IEEE

Int’l Workshop Workload Characterization (WWC 04), IEEE

Press, 2004, pp. 55-62.

9. L. Eeckhout, H. Vandierendonck, and K. De Bosschere,

‘‘Quantifying the Impact of Input Data Sets on Program

Behavior and Its Applications,’’ J. Instruction-Level Paral-

lelism, vol. 5, Feb. 2003, http://www.jilp.org/vol5.

10. A. Phansalkar et al., ‘‘Measuring Program Similarity:

Experiments with SPEC CPU Benchmark Suites,’’ Proc.

IEEE Int’l Symp. Performance Analysis of Systems and

Software (ISPASS 05), IEEE Press, 2005, pp. 10-20.

11. L. Eeckhout, J. Sampson, and B. Calder, ‘‘Exploiting Program

Microarchitecture Independent Characteristics and Phase

Behavior for Reduced Benchmark Suite Simulation,’’ Proc.

IEEE Int’l Symp. Workload Characterization (IISWC 05), IEEE

Press, 2005, pp. 2-12.

12. K. Hoste et al., ‘‘Performance Prediction Based on Inherent

Program Similarity,’’ Proc. Int’l Conf. Parallel Architectures

and Compilation Techniques (PACT 06), IEEE CS Press,

2006, pp. 114-122.

13. J.J. Yi, D.J. Lilja, and D.M. Hawkins, ‘‘A Statistically

Rigorous Approach for Improving Simulation Methodolo-

gy,’’ Proc. 9th Int’l Symp. High-Performance Computer

Architecture (HPCA 03), IEEE CS Press, 2003, pp. 281-

291.

14. A. Joshi et al., ‘‘Measuring Benchmark Similarity Using

Inherent Program Characteristics,’’ IEEE Trans. Compu-

ters, vol. 55, no. 6, June 2006, pp. 769-782.

15. J.J. Yi et al., ‘‘The Exigency of Benchmark and Compiler

Drift: Designing Tomorrow’s Processors with Yesterday’s

Tools,’’ Proc. 20th Ann. Int’l Conf. Supercomputing (ICS

06), ACM Press, 2006, pp. 75-86.

.........................................................................................................................................................................................................................


.......................................................................

66 IEEE MICRO

technique called principal components anal-ysis (PCA). The other is a machine-learningalgorithm called a genetic algorithm (GA).The goal of both approaches is to make thedata set more understandable by reducing itsdimensionality.

Principal components analysisPCA has two main properties: It reduces

the data set’s dimensionality, and it removescorrelation from the data set.9 Both featuresare important to increasing the data set’sunderstandability. First, analyzing a lower-dimensional space is easier than analyzinga higher-dimensional space. Second, analyz-ing correlated data results in a distortedview—a distance measure in a correlatedspace places too much weight on correlatedvariables. For example, suppose that twocorrelated program characteristics are a con-sequence of the same underlying programcharacteristic. Then consider two bench-marks that show different behavior in termsof this underlying characteristic. Measuringthis difference in terms of the correlatedprogram characteristics will magnify it.Removing the correlation from the dataset will give equal weight to all underlyingprogram characteristics.

Before applying PCA, we first normalizethe data set. We do this in two steps:

1. computing the mean x̄ and standarddeviation s per microarchitecture-inde-pendent characteristic Xi, 1 # i # pacross all benchmarks, and

2. subtracting the mean and dividing bythe standard deviation: Yi 5 (Xi 2 x̄)/s.

The result is that the transformedcharacteristics Yi, 1 # i # p have a zeromean and a unit standard deviation. Thegoal of the normalization is to put allcharacteristics on a common scale.

The input to PCA is a matrix in whichthe rows are the benchmarks and thecolumns are the normalized microarchitec-ture-independent characteristics Yi. PCAcomputes new variables, called principalcomponents, which are linear combinationsof the microarchitecture-independent char-acteristics, such that all principal compo-nents are uncorrelated. In other words, PCA

transforms the p normalized microarchitec-ture-independent characteristics Y1, Y2, …,Yp into p principal components Z1, Z2, …,Zp with Zi ~

Ppj~1 aijYj . This transfor-

mation has two main properties. First, thefirst principal component exhibits thelargest variance, followed by the second,followed by the third, and so on. That is,Var[Z1] $ Var[Z2] $ … $ Var[Zp].Intuitively, this means that Z1 contains themost information, and Zp the least. Second,the dimensions along which the principalcomponents are identified are orthogonal toeach other. That is, the covariance betweenprincipal components is zero, that is,Cov[Zi, Zj] 5 0, ;i ? j. This means thereis no information overlap between theprincipal components.

By removing the principal componentswith the lowest variance, we can reduce thedata set’s dimensionality while controlling theamount of information lost. Determining thenumber of principal components to retain isalso important. Too few principal componentswon’t capture important trends in the data set,and too many can lead to a curse ofdimensionality problem. To measure thefraction of information retained in thisq-dimensional space, we use the amount ofvariance

Pqi~1 Var Zi½ �� Pp

i~1 Var Zi½ ��

accounted for by these q principal compo-nents. For example, we can use criteria for datareduction such as ‘‘the retained principalcomponents should explain 70 or 80 percentof the total variance.’’ For our data set, weretain eight principal components, whichexplain 78 percent of the original data set’stotal variance.

After PCA, it is important to normalizethe principal components to give equalweight to all the retained principal compo-nents. Our intuition is that by doing so, wegive equal weight to the underlying programbehaviors extracted by PCA.

By examining the most important qprincipal components Zi ~

Ppj~1 aijYj , i

5 1, …, q, we can interpret these principalcomponents in terms of the original micro-architecture-independent characteristics. Acoefficient aij that is close to +1 or 21implies a strong impact of the originalcharacteristic Xj on principal component Zi.

........................................................................

MAY–JUNE 2007 67

A coefficient aij close to 0, on the otherhand, implies no impact.

Genetic algorithmAlthough PCA reduces the data set’s

dimensionality effectively, the fact that eachprincipal component is a linear combina-tion of the original workload characteristicscomplicates the understandability of thelower-dimensional workload space. Oursecond workload analysis methodology,which uses a genetic algorithm (GA), alsoreduces the data set’s dimensionality, butthe retained dimensions are easier to un-derstand because each dimension is a singleworkload characteristic.

A GA is an evolutionary optimizationmethod that starts from a set of populationsof random solutions. The algorithm com-putes a fitness score for each solution andselects the solutions with the highest fitnessscores to construct the next generation ofsolutions. It constructs the next generationby applying mutation, crossover, and mi-gration to the selected solutions. Mutationrandomly changes a single solution, cross-over generates new solutions by mixingexisting solutions, and migration allowssolutions to switch populations. The GArepeats this process—that is, it constructsnew generations—until the fitness scoreshows no further improvement.

A solution is a series of N zeros and ones,with N being the number of microarchi-tecture-independent characteristics. A oneselects a program characteristic, and a zeroexcludes a program characteristic.

The GA’s fitness score evaluates thecorrelation coefficient of the distancesbetween benchmark pairs in the originaldata set versus the distances betweenbenchmark pairs in the reduced data set,which includes only characteristics witha one assigned. We use PCA to computethe distance in the original data set as well asin the reduced data set. That is, we firstapply PCA on both data sets, retain theprincipal components with a variance great-er than one, normalize the principalcomponents, and finally compute theEuclidean distances between benchmarksin terms of their normalized principalcomponents. We take this additional PCA

step to discount the correlation in the dataset from the distance measure while ac-counting for the most important underlyingprogram characteristics. The GA’s endresult is a limited number of programcharacteristics that accurately characterizea program’s behavior. In selecting eightprogram characteristics (to yield the samedimensionality that we obtained with PCA),the GA achieved a 0.86 correlation co-efficient between the distances in theoriginal data set compared with the dis-tances in the reduced data set.

EvaluationTo gain confidence about the reduced

data set’s validity with respect to the originaldata set, we use the reduced data set tocompose a subset of benchmarks. Previouswork proposed an approach for composingrepresentative benchmark suites based oninherent program behavior.10–13 The goal ofbenchmark suite composition is to selecta small set of benchmarks representative ofa larger set, so that all major programbehaviors are represented in the composedbenchmark suite. This benchmark suitecomposition method consists of three steps.The first step measures a number ofmicroarchitecture-independent characteris-tics for all benchmarks. The second stepreduces the data set’s dimensionality. Wecan do this with PCA, or we can use the GAto select a limited number of programcharacteristics. The third step clusters thevarious benchmarks according to theirinherent behavior. The goal of clusteranalysis is to group benchmarks with similarprogram behavior in a single cluster andbenchmarks with dissimilar program behav-ior in different clusters.9 The benchmarksuite subset composer then chooses a repre-sentative from each cluster; this is thebenchmark closest to the cluster’s centroid.The representative’s weight is the number ofbenchmarks it represents in the cluster.

This methodology lets us find subsets ofbenchmarks that are representative acrossdifferent microarchitectures.11,13 Figure 1illustrates this by showing the CPI pre-diction error for the selected subset withrespect to all benchmarks for the Alpha21264 machine. The CPI prediction error

.........................................................................................................................................................................................................................


.......................................................................

68 IEEE MICRO

is the relative error between the averageCPI of all benchmarks and the CPIcomputed as a weighted average over theCPI values of the representatives. Thisgraph shows curves for both the PCA andGA approaches. In both cases, eightdimensions are retained. CPI predictionerror typically decreases with increasingsubset sizes. Once beyond a subset size of50 of the 118 benchmarks, the maximumCPI prediction error is consistently below5 percent.

Comparing benchmark suitesBecause the workload characterization

obtained with the GA is easier to understand,we use that methodology to characterize the118 benchmarks. The following are the eightmicroarchitecture-independent characteris-tics retained by the GA:

N probability of a register dependencedistance # 16

N branch predictability of per-address,global history table (PAg) prediction-by-partial-matching (PPM) predictor

N percentage of multiply instructionsN data stream working-set size at 32-byte

block levelN probability of a local load stride 5 0N probability of a global load stride # 8N probability of a local store stride # 8N probability of a local store stride #

4,096

These key characteristics relate to registerdependence distance distribution, branchpredictability, instruction mix, data streamworking-set size, and data working-setaccess patterns.

We use Kiviat plots to visualize a bench-mark’s inherent behavior in terms of theeight key microarchitecture-independentcharacteristics. Figure 2 shows Kiviat plotsfor the 118 benchmarks. Each axis repre-sents a microarchitecture-independent char-acteristic. The various rings within the plotrepresent the mean value minus onestandard deviation, the mean value, andthe mean value plus one standard deviationalong each dimension. The center point andthe outer ring represent values outside therange defined by the mean minus and plusthe standard deviation. We characterize theprominent program behaviors by connect-ing their key characteristics to form an areashown in dark gray, which visualizesa benchmark’s inherent behavior. By com-paring the dark areas across the variousbenchmarks, we can see how different theirbehaviors are. We also cluster the bench-marks into 50 clusters, each bounded bya box, with the cluster representative in-dicated by an asterisk.

These plots provide valuable insight. Forexample, some benchmarks seem to beisolated. These benchmarks exhibit programcharacteristics very dissimilar to any of theother benchmarks, so they appear assingleton clusters. Examples are blast, mcf,and swim. We can derive the reason fortheir particular program behavior from theplots. For example, the reason for blast’sbeing isolated is its large working set, andmcf exhibits long local store strides (that is,large address differences between consecu-tive memory accesses by the same staticstore instruction).

Another observation is that some bench-marks are susceptible to their input, whereas

Figure 1. CPI prediction error as a function of subset size for the PCA and GA data reduction approaches.

........................................................................

MAY–JUNE 2007 69

others are not. For example, the input to gccand perl causes quite different behavioralcharacteristics; for vortex, gzip, and csu, onthe other hand, the input doesn’t seem toaffect the behavioral characteristics as greatly.Yet another interesting observation is that

many benchmarks from recently introducedbenchmark suites exhibit dissimilar inherentbehavior compared to SPECcpu2000 bench-marks. More particularly, about 40 percentof the clusters don’t contain any of theSPECcpu benchmarks. This suggests that

Figure 2. Kiviat diagrams representing the eight key microarchitecture-independent characteristics of the 118 benchmarks.

Darkest areas represent inherent behavior patterns. Boxes represent clusters, and asterisks indicate cluster representatives.

.........................................................................................................................................................................................................................


.......................................................................

70 IEEE MICRO

SPECcpu doesn’t adequately cover the entireworkload space, and that a more completebenchmark suite should incorporate addi-tional benchmarks for the microprocessordesign cycle. The methodology presentedhere lets designers select a diverse andrepresentative benchmark set.

Several potential avenues toward an evenmore effective workload characterization

methodology are worth addressing. First,speeding up the time-consuming profilingstep of collecting microarchitecture-independent characteristics would greatlyimprove our methodology’s usability. Possi-ble ways to accomplish this include sampling,hardware acceleration, and multithreadedprofiling. Second, as we have demon-strated, a microarchitecture-independentworkload characterization is more accurateand informative than a microarchitecture-dependent characterization. Ideally, however,the characterization should also be indepen-dent of the ISA and the compiler, to capturea program’s true inherent behavior. Makingthe characterization both ISA and compilerindependent is thus another interesting pathto explore. Third, an important question iswhat program characteristics to include in theanalysis. It is important that the characteriza-tion includes a sufficiently diverse set ofcharacteristics. For example, extending thecurrent set of characteristics to characterizemultithreaded workloads in a microarchitec-ture-independent manner is an area to focuson. Also, as architectures evolve, it isimportant to revisit these program character-istics. Finally, other statistical, machine-learning, or data-mining techniques mightbe useful for analyzing workload behaviorand extracting key program behavior char-acteristics. MICRO

AcknowledgmentsWe thank the anonymous reviewers for

their valuable comments. Kenneth Hostereceives support through a doctoral studentfellowship from the Institute for the Pro-motion of Innovation by Science andTechnology in Flanders (IWT-Flanders).Lieven Eeckhout receives support througha postdoctoral fellowship of the Fundfor Scientific Research, Flanders, Belgium

(FWO-Vlaanderen). We also thank KjellAndresen from the University of Oslo foroffering Alpha machine compute cycles.

................................................................................................

References1. C. Lee, M. Potkonjak, and W.H. Mangione-

Smith, ‘‘MediaBench: A Tool for Evaluating

and Synthesizing Multimedia and Commu-

nications Systems,’’ Proc. 30th Ann. IEEE/

ACM Int’l Symp. Microarchitecture (MICRO

97), IEEE CS Press, 1997, pp. 330-335.

2. M.R. Guthaus et al., ‘‘MiBench: A Free,

Commercially Representative Embedded

Benchmark Suite,’’ Proc. 4th Ann. IEEE

Int’l Workshop Workload Characterization

(WWC 01), IEEE CS Press, 2001, pp. 3-14.

3. C.-B. Cho et al., ‘‘Workload Characteriza-

tion of Biometric Applications on Pentium 4

Microarchitecture,’’ Proc. IEEE Int’l Symp.

Workload Characterization (IISWC 05),

IEEE Press, 2005, pp. 76-86.

4. Y. Li and T. Li,BioInfoMark: A Bioinformatic

Benchmark Suite for Computer Architec-

ture Research, tech. report, Univ. of Flor-

ida, Dept. of ECE, 2005.

5. D.A. Bader et al., ‘‘BioPerf: A Benchmark

Suite to Evaluate High-Performance Com-

puter Architecture on Bioinformatics Appli-

cations,’’ Proc. IEEE Int’l Symp. Workload

Characterization (IISWC 05), IEEE Press,

2005, pp. 163-173.

6. K. Hoste and L. Eeckhout, ‘‘Comparing

Benchmarks Using Key Microarchitecture-

Independent Characteristics,’’ Proc. IEEE

Int’l Symp. Workload Characterization

(IISWC 06), IEEE Press, 2006, pp. 83-92.

7. K. Hoste et al., ‘‘Performance Prediction

Based on Inherent Program Similarity,’’

Proc. Int’l Conf. Parallel Architectures and

Compilation Techniques (PACT 06), IEEE

CS Press, 2006, pp. 114-122.

8. I.K. Chen, J.T. Coffey, and T.N. Mudge,

‘‘Analysis of Branch Prediction via Data

Compression,’’ Proc. 7th Int’l Conf. Archi-

tectural Support for Programming Lan-

guages and Operating Systems (ASPLOS

VII), ACM Press, 1996, pp. 128-137.

9. R.A. Johnson and D.W. Wichern, Applied

Multivariate Statistical Analysis, 5th ed.,

Prentice Hall, 2002.

10. L. Eeckhout, H. Vandierendonck, and K. De

Bosschere, ‘‘Quantifying the Impact of Input

........................................................................

MAY–JUNE 2007 71

Data Sets on Program Behavior and Its

Applications,’’ J. Instruction-Level Parallelism,

vol. 5, Feb. 2003, http://www.jilp.org/vol5.

11. A. Joshi et al., ‘‘Measuring Benchmark

Similarity Using Inherent Program Charac-

teristics,’’ IEEE Trans. Computers, vol. 55,

no. 6, June 2006, pp. 769-782.

12. L. Eeckhout, J. Sampson, and B. Calder,

‘‘Exploiting Program Microarchitecture In-

dependent Characteristics and Phase Be-

havior for Reduced Benchmark Suite Sim-

ulation,’’ Proc. IEEE Int’l Symp. Workload

Characterization (IISWC 05), IEEE Press,

2005, pp. 2-12.

13. J.J. Yi et al., ‘‘Evaluating Benchmark

Subsetting Approaches,’’ Proc. IEEE Int’l

Symp. Workload Characterization (IISWC

06), IEEE Press, 2006, pp. 93-104.

Kenneth Hoste is a doctoral student in theElectronics and Information Systems De-partment of Ghent University, Belgium.His research interests include computerarchitecture in general and workload char-

acterization in particular. Hoste has an MSin computer science from Ghent University.

Lieven Eeckhout is an assistant professor inthe Electronics and Information SystemsDepartment of Ghent University, Belgium.His research interests include computerarchitecture, virtual machines, performanceanalysis and modeling, and workload char-acterization. Eeckhout has a PhD in com-puter science and engineering from GhentUniversity. He is a member of the IEEE.

Direct questions and comments about thisarticle to Kenneth Houste and LievenEeckhout, ELIS, Ghent University, Sint-Pieternieuwstraat 41, B-9000 Gent, Belgium;[email protected] and [email protected].

For more information on this or any

other computing topic, please visit our

Digital Library at http://computer.org/

publications/dlib.

.........................................................................................................................................................................................................................


.......................................................................

72 IEEE MICRO

MICROARCHITECTURE-INDEPENDENT WORKLOAD ...kehoste/ELIS/pub/hoste07micro...Ofthe118benchmarksinourdataset,we compare two for our case study: SPEC-cpu2000’s gzip-graphic and BioInfoMark’s

Documents