. ....................................................................................................................................................................................................................................................... MICROARCHITECTURE-INDEPENDENT WORKLOAD CHARACTERIZATION . ....................................................................................................................................................................................................................................................... FOR COMPUTER DESIGNERS, UNDERSTANDING THE CHARACTERISTICS OF WORKLOADS RUNNING ON CURRENT AND FUTURE COMPUTER SYSTEMS IS OF UTMOST IMPORTANCE DURING MICROPROCESSOR DESIGN. A MICROARCHITECTURE-INDEPENDENT METHOD ENSURES AN ACCURATE CHARACTERIZATION OF INHERENT PROGRAM BEHAVIOR AND AVOIDS THE WEAKNESSES OF MICROARCHITECTURE-DEPENDENT METRICS. ...... The workloads that run on our computer systems are always evolving. Soft- ware companies continually come up with new applications, many triggered by the increasing computational power available. It is important that computer designers un- derstand the characteristics of these emerging workloads to optimize systems for their target workloads. Moreover, the need for a solid workload characterization methodol- ogy is increasing with the shift to chip multiprocessors—especially heterogeneous CMPs with various cores specialized for particular types of workloads. Computer architects and performance analysts are well aware of the workload drift phenomenon, and to address it, they typically collect benchmarks to represent emerging workloads. Examples of recently introduced benchmark suites covering emerging work- loads are MediaBench for multimedia work- loads, MiBench and EEMBC (Embedded Microprocessor Benchmark Consortium, http://www.eembc.org) for embedded work- loads, BioMetricsWorkload for biometrics workloads, and BioInfoMark and BioPerf for bioinformatics workloads. 1–5 A key question is how different these workloads are from existing, well-known benchmark suites. An- swering this question is important for two reasons. First, it provides insight into whether next-generation microprocessors should be designed differently from today’s machines to accommodate the emerging workloads. Sec- ond, if the new workload domain is not significantly different from existing bench- mark suites, there is no need to include the new benchmarks in the design process— simulating the additional benchmarks would only add to the simulation time without providing additional information. To find out how different the emerging workloads are from existing benchmark suites, computer designers usually compare the characteristics of the emerging work- loads with the characteristics of well-known benchmark suites. A typical approach is to characterize the emerging workload in terms of microarchitecture-dependent metrics. For example, most workload characteriza- tion studies run benchmarks representing the emerging workload on a given micro- processor while measuring program char- acteristics with hardware performance coun- ters. Other studies use simulation to derive similar results. The program characteristics typically measured include instruction mix and microarchitecture-dependent character- istics such as instructions per cycle (IPC), cache miss rates, branch misprediction rates, Kenneth Hoste Lieven Eeckhout Ghent University 0272-1732/07/$20.00 G 2007 IEEE Published by the IEEE Computer Society. ....................................................................... . 63
10
Embed
MICROARCHITECTURE-INDEPENDENT WORKLOAD ...kehoste/ELIS/pub/hoste07micro...Ofthe118benchmarksinourdataset,we compare two for our case study: SPEC-cpu2000’s gzip-graphic and BioInfoMark’s
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
FOR COMPUTER DESIGNERS, UNDERSTANDING THE CHARACTERISTICS OF WORKLOADS
RUNNING ON CURRENT AND FUTURE COMPUTER SYSTEMS IS OF UTMOST IMPORTANCE
DURING MICROPROCESSOR DESIGN. A MICROARCHITECTURE-INDEPENDENT METHOD
ENSURES AN ACCURATE CHARACTERIZATION OF INHERENT PROGRAM BEHAVIOR AND
AVOIDS THE WEAKNESSES OF MICROARCHITECTURE-DEPENDENT METRICS.
......The workloads that run on ourcomputer systems are always evolving. Soft-ware companies continually come up withnew applications, many triggered by theincreasing computational power available. Itis important that computer designers un-derstand the characteristics of these emergingworkloads to optimize systems for theirtarget workloads. Moreover, the need fora solid workload characterization methodol-ogy is increasing with the shift to chipmultiprocessors—especially heterogeneousCMPs with various cores specialized forparticular types of workloads.
Computer architects and performanceanalysts are well aware of the workload driftphenomenon, and to address it, they typicallycollect benchmarks to represent emergingworkloads. Examples of recently introducedbenchmark suites covering emerging work-loads are MediaBench for multimedia work-loads, MiBench and EEMBC (EmbeddedMicroprocessor Benchmark Consortium,http://www.eembc.org) for embedded work-loads, BioMetricsWorkload for biometricsworkloads, and BioInfoMark and BioPerf forbioinformatics workloads.1–5 A key questionis how different these workloads are fromexisting, well-known benchmark suites. An-swering this question is important for two
reasons. First, it provides insight into whethernext-generation microprocessors should bedesigned differently from today’s machines toaccommodate the emerging workloads. Sec-ond, if the new workload domain is notsignificantly different from existing bench-mark suites, there is no need to include thenew benchmarks in the design process—simulating the additional benchmarks wouldonly add to the simulation time withoutproviding additional information.
To find out how different the emergingworkloads are from existing benchmarksuites, computer designers usually comparethe characteristics of the emerging work-loads with the characteristics of well-knownbenchmark suites. A typical approach is tocharacterize the emerging workload in termsof microarchitecture-dependent metrics.For example, most workload characteriza-tion studies run benchmarks representingthe emerging workload on a given micro-processor while measuring program char-acteristics with hardware performance coun-ters. Other studies use simulation to derivesimilar results. The program characteristicstypically measured include instruction mixand microarchitecture-dependent character-istics such as instructions per cycle (IPC),cache miss rates, branch misprediction rates,
Kenneth Hoste
Lieven Eeckhout
Ghent University
0272-1732/07/$20.00 G 2007 IEEE Published by the IEEE Computer Society.
and translation look-aside buffer (TLB)miss rates. These studies conclude eitherthat two workloads are dissimilar if theirhardware performance counter characteris-tics are dissimilar, or that two workloads aresimilar if their hardware performancecounter characteristics are similar.
A major pitfall of a microarchitecture-dependent workload characterization meth-odology is that it can hide underlying,inherent program behavior. To avoid thispitfall, we advocate characterizing work-loads in a microarchitecture-independentmanner to capture their true inherentprogram behavior. This article presentsa microarchitecture-independent workloadcharacterization methodology and demon-strates its usefulness in characterizing bench-mark suites for emerging workloads.
Pitfall of microarchitecture-dependentworkload characterization
Before presenting our methodology, wepresent a case study that illustrates theproblem with microarchitecture-dependentworkload characterization. We considered118 benchmarks from six benchmark suites:SPECcpu2000, MediaBench, MiBench,BioInfoMark, BioMetricsWorkload, and
CommBench. Throughout the article, werefer to a benchmark-input pair as a bench-mark. For each benchmark, we collectedmicroarchitecture-dependent characteristicswith hardware performance counters usingthe Digital Continuous Profiling Infrastruc-ture (DCPI) tool on two hardware plat-forms: an Alpha 21164A (EV56) and anAlpha 21264A (EV67) machine. Thecharacteristics we collected are cycles perinstruction (CPI) on both the EV56 and theEV67; and the L1 D-cache, L1 I-cache, andL2 cache miss rates on the EV56.
We also measured a set of microarchi-tecture-independent characteristics in sixcategories. Table 1 summarizes these char-acteristics; we present a more detaileddescription elsewhere.6 These microarchi-tecture-independent characteristics can becollected through binary instrumentationusing tools such as ATOM (which weuse), Pin, Valgrind, and DynamoRIO.These characteristics are microarchitecture-independent but not independent of theinstruction set architecture (ISA) and thecompiler. In previous work, however, weobserved that these characteristics providea fairly accurate characterization picture,even across platforms.7
Table 1. Microarchitecture-independent characteristics collected to characterize workload behavior.
Of the 118 benchmarks in our data set, wecompare two for our case study: SPEC-cpu2000’s gzip-graphic and BioInfoMark’sfasta. Many pairs of benchmarks could serveas a case study, but we limit ourselves toa single example. Table 2 shows microarch-itecture-dependent and microarchitecture-independent characteristics for the twobenchmarks, as well as the maximum valueobserved across all 118 benchmarks (to putthe values for gzip and fasta in perspective).The microarchitecture-dependent character-istics, CPI and cache miss rates, are fairlysimilar for gzip and fasta (especially com-pared with the maximum values observedacross the entire data set).
The microarchitecture-independent char-acteristics, on the other hand, are quitedifferent. The fasta benchmark’s data work-ing set is one order of magnitude smallerthan that of gzip (although fasta’s dynamicinstruction count is about 6.7 times larger).Memory access patterns are also verydifferent for the two benchmarks. Forexample, the probability of the same staticload addressing the same memory locationat consecutive executions—a local loadstride equal to zero—is more than twice aslarge for gzip as for fasta. The probability ofthe difference in memory addresses ofconsecutive memory writes—the global
store stride—being smaller than 64 is morethan 2.5 times as large for fasta as for gzip.
We conclude that although microarchi-tecture-dependent workload behavior is fair-ly similar, inherent program behavior can bevery different, which can be misleading formicroprocessor design. Although two work-loads behave the same on one microarchi-tecture, they might exhibit different behaviorand performance on other microarchitec-tures. We therefore advocate using micro-architecture-independent program character-istics to characterize workloads. Similarmicroarchitecture-independent behavior im-plies similar microarchitecture-dependentbehavior; dissimilar microarchitecture-dependent behavior implies dissimilar in-herent behavior.
MethodologyAlthough collecting microarchitecture-
independent characteristics is straightforwardwith binary instrumentation tools, analyzingthe collected data is far from trivial. Forexample, our data set is a 118 3 47 datamatrix. That is, there are 118 benchmarks,for which we measure p5 47 characteristics.Obviously, gaining insight into a large dataset is difficult without an appropriate dataanalysis technique. Here, we discuss twopossible techniques. One is a statistical
Table 2. Case study comparing microarchitecture-dependent and -independent characteristics for
benchmarks gzip-graphic and fasta.
Characteristic gzip-graphic fasta All 118 benchmarks (maximum)
Microarchitecture-dependent
CPI on Alpha 21164 1.01 0.92 14.04
CPI on Alpha 21264 0.63 0.66 5.22
L1 D-cache misses per instruction (%) 1.61 1.90 22.58
L1 I-cache misses per instruction (%) 0.15 0.18 6.44
L2 cache misses per instruction (%) 0.78 0.25 17.59
Microarchitecture-independent
Data working set (32-byte blocks) 3,857,693 438,726 31,709,065
Data working set (4-Kbyte pages) 46,199 4,058 248,108
Instruction working set (32-byte blocks) 1,394 3,801 24,377
Instruction working set (4-Kbyte pages) 33 79 341
Probability of local load stride 5 0 0.67 0.30 0.91
Probability of local store stride 5 0 0.64 0.05 0.99
Probability of global load stride # 64 0.26 0.18 0.86
Probability of global store stride # 64 0.35 0.93 0.99
technique called principal components anal-ysis (PCA). The other is a machine-learningalgorithm called a genetic algorithm (GA).The goal of both approaches is to make thedata set more understandable by reducing itsdimensionality.
Principal components analysisPCA has two main properties: It reduces
the data set’s dimensionality, and it removescorrelation from the data set.9 Both featuresare important to increasing the data set’sunderstandability. First, analyzing a lower-dimensional space is easier than analyzinga higher-dimensional space. Second, analyz-ing correlated data results in a distortedview—a distance measure in a correlatedspace places too much weight on correlatedvariables. For example, suppose that twocorrelated program characteristics are a con-sequence of the same underlying programcharacteristic. Then consider two bench-marks that show different behavior in termsof this underlying characteristic. Measuringthis difference in terms of the correlatedprogram characteristics will magnify it.Removing the correlation from the dataset will give equal weight to all underlyingprogram characteristics.
Before applying PCA, we first normalizethe data set. We do this in two steps:
1. computing the mean x̄ and standarddeviation s per microarchitecture-inde-pendent characteristic Xi, 1 # i # pacross all benchmarks, and
2. subtracting the mean and dividing bythe standard deviation: Yi 5 (Xi 2 x̄)/s.
The result is that the transformedcharacteristics Yi, 1 # i # p have a zeromean and a unit standard deviation. Thegoal of the normalization is to put allcharacteristics on a common scale.
The input to PCA is a matrix in whichthe rows are the benchmarks and thecolumns are the normalized microarchitec-ture-independent characteristics Yi. PCAcomputes new variables, called principalcomponents, which are linear combinationsof the microarchitecture-independent char-acteristics, such that all principal compo-nents are uncorrelated. In other words, PCA
transforms the p normalized microarchitec-ture-independent characteristics Y1, Y2, …,Yp into p principal components Z1, Z2, …,Zp with Zi ~
Ppj~1 aijYj . This transfor-
mation has two main properties. First, thefirst principal component exhibits thelargest variance, followed by the second,followed by the third, and so on. That is,Var[Z1] $ Var[Z2] $ … $ Var[Zp].Intuitively, this means that Z1 contains themost information, and Zp the least. Second,the dimensions along which the principalcomponents are identified are orthogonal toeach other. That is, the covariance betweenprincipal components is zero, that is,Cov[Zi, Zj] 5 0, ;i ? j. This means thereis no information overlap between theprincipal components.
By removing the principal componentswith the lowest variance, we can reduce thedata set’s dimensionality while controlling theamount of information lost. Determining thenumber of principal components to retain isalso important. Too few principal componentswon’t capture important trends in the data set,and too many can lead to a curse ofdimensionality problem. To measure thefraction of information retained in thisq-dimensional space, we use the amount ofvariance
Pqi~1 Var Zi½ �� �� Pp
i~1 Var Zi½ �� �
accounted for by these q principal compo-nents. For example, we can use criteria for datareduction such as ‘‘the retained principalcomponents should explain 70 or 80 percentof the total variance.’’ For our data set, weretain eight principal components, whichexplain 78 percent of the original data set’stotal variance.
After PCA, it is important to normalizethe principal components to give equalweight to all the retained principal compo-nents. Our intuition is that by doing so, wegive equal weight to the underlying programbehaviors extracted by PCA.
By examining the most important qprincipal components Zi ~
Ppj~1 aijYj , i
5 1, …, q, we can interpret these principalcomponents in terms of the original micro-architecture-independent characteristics. Acoefficient aij that is close to +1 or 21implies a strong impact of the originalcharacteristic Xj on principal component Zi.
A coefficient aij close to 0, on the otherhand, implies no impact.
Genetic algorithmAlthough PCA reduces the data set’s
dimensionality effectively, the fact that eachprincipal component is a linear combina-tion of the original workload characteristicscomplicates the understandability of thelower-dimensional workload space. Oursecond workload analysis methodology,which uses a genetic algorithm (GA), alsoreduces the data set’s dimensionality, butthe retained dimensions are easier to un-derstand because each dimension is a singleworkload characteristic.
A GA is an evolutionary optimizationmethod that starts from a set of populationsof random solutions. The algorithm com-putes a fitness score for each solution andselects the solutions with the highest fitnessscores to construct the next generation ofsolutions. It constructs the next generationby applying mutation, crossover, and mi-gration to the selected solutions. Mutationrandomly changes a single solution, cross-over generates new solutions by mixingexisting solutions, and migration allowssolutions to switch populations. The GArepeats this process—that is, it constructsnew generations—until the fitness scoreshows no further improvement.
A solution is a series of N zeros and ones,with N being the number of microarchi-tecture-independent characteristics. A oneselects a program characteristic, and a zeroexcludes a program characteristic.
The GA’s fitness score evaluates thecorrelation coefficient of the distancesbetween benchmark pairs in the originaldata set versus the distances betweenbenchmark pairs in the reduced data set,which includes only characteristics witha one assigned. We use PCA to computethe distance in the original data set as well asin the reduced data set. That is, we firstapply PCA on both data sets, retain theprincipal components with a variance great-er than one, normalize the principalcomponents, and finally compute theEuclidean distances between benchmarksin terms of their normalized principalcomponents. We take this additional PCA
step to discount the correlation in the dataset from the distance measure while ac-counting for the most important underlyingprogram characteristics. The GA’s endresult is a limited number of programcharacteristics that accurately characterizea program’s behavior. In selecting eightprogram characteristics (to yield the samedimensionality that we obtained with PCA),the GA achieved a 0.86 correlation co-efficient between the distances in theoriginal data set compared with the dis-tances in the reduced data set.
EvaluationTo gain confidence about the reduced
data set’s validity with respect to the originaldata set, we use the reduced data set tocompose a subset of benchmarks. Previouswork proposed an approach for composingrepresentative benchmark suites based oninherent program behavior.10–13 The goal ofbenchmark suite composition is to selecta small set of benchmarks representative ofa larger set, so that all major programbehaviors are represented in the composedbenchmark suite. This benchmark suitecomposition method consists of three steps.The first step measures a number ofmicroarchitecture-independent characteris-tics for all benchmarks. The second stepreduces the data set’s dimensionality. Wecan do this with PCA, or we can use the GAto select a limited number of programcharacteristics. The third step clusters thevarious benchmarks according to theirinherent behavior. The goal of clusteranalysis is to group benchmarks with similarprogram behavior in a single cluster andbenchmarks with dissimilar program behav-ior in different clusters.9 The benchmarksuite subset composer then chooses a repre-sentative from each cluster; this is thebenchmark closest to the cluster’s centroid.The representative’s weight is the number ofbenchmarks it represents in the cluster.
This methodology lets us find subsets ofbenchmarks that are representative acrossdifferent microarchitectures.11,13 Figure 1illustrates this by showing the CPI pre-diction error for the selected subset withrespect to all benchmarks for the Alpha21264 machine. The CPI prediction error
is the relative error between the averageCPI of all benchmarks and the CPIcomputed as a weighted average over theCPI values of the representatives. Thisgraph shows curves for both the PCA andGA approaches. In both cases, eightdimensions are retained. CPI predictionerror typically decreases with increasingsubset sizes. Once beyond a subset size of50 of the 118 benchmarks, the maximumCPI prediction error is consistently below5 percent.
Comparing benchmark suitesBecause the workload characterization
obtained with the GA is easier to understand,we use that methodology to characterize the118 benchmarks. The following are the eightmicroarchitecture-independent characteris-tics retained by the GA:
N probability of a register dependencedistance # 16
N branch predictability of per-address,global history table (PAg) prediction-by-partial-matching (PPM) predictor
N percentage of multiply instructionsN data stream working-set size at 32-byte
block levelN probability of a local load stride 5 0N probability of a global load stride # 8N probability of a local store stride # 8N probability of a local store stride #
4,096
These key characteristics relate to registerdependence distance distribution, branchpredictability, instruction mix, data streamworking-set size, and data working-setaccess patterns.
We use Kiviat plots to visualize a bench-mark’s inherent behavior in terms of theeight key microarchitecture-independentcharacteristics. Figure 2 shows Kiviat plotsfor the 118 benchmarks. Each axis repre-sents a microarchitecture-independent char-acteristic. The various rings within the plotrepresent the mean value minus onestandard deviation, the mean value, andthe mean value plus one standard deviationalong each dimension. The center point andthe outer ring represent values outside therange defined by the mean minus and plusthe standard deviation. We characterize theprominent program behaviors by connect-ing their key characteristics to form an areashown in dark gray, which visualizesa benchmark’s inherent behavior. By com-paring the dark areas across the variousbenchmarks, we can see how different theirbehaviors are. We also cluster the bench-marks into 50 clusters, each bounded bya box, with the cluster representative in-dicated by an asterisk.
These plots provide valuable insight. Forexample, some benchmarks seem to beisolated. These benchmarks exhibit programcharacteristics very dissimilar to any of theother benchmarks, so they appear assingleton clusters. Examples are blast, mcf,and swim. We can derive the reason fortheir particular program behavior from theplots. For example, the reason for blast’sbeing isolated is its large working set, andmcf exhibits long local store strides (that is,large address differences between consecu-tive memory accesses by the same staticstore instruction).
Another observation is that some bench-marks are susceptible to their input, whereas
Figure 1. CPI prediction error as a function of subset size for the PCA and GA data reduction approaches.
others are not. For example, the input to gccand perl causes quite different behavioralcharacteristics; for vortex, gzip, and csu, onthe other hand, the input doesn’t seem toaffect the behavioral characteristics as greatly.Yet another interesting observation is that
many benchmarks from recently introducedbenchmark suites exhibit dissimilar inherentbehavior compared to SPECcpu2000 bench-marks. More particularly, about 40 percentof the clusters don’t contain any of theSPECcpu benchmarks. This suggests that
Figure 2. Kiviat diagrams representing the eight key microarchitecture-independent characteristics of the 118 benchmarks.
Darkest areas represent inherent behavior patterns. Boxes represent clusters, and asterisks indicate cluster representatives.
SPECcpu doesn’t adequately cover the entireworkload space, and that a more completebenchmark suite should incorporate addi-tional benchmarks for the microprocessordesign cycle. The methodology presentedhere lets designers select a diverse andrepresentative benchmark set.
Several potential avenues toward an evenmore effective workload characterization
methodology are worth addressing. First,speeding up the time-consuming profilingstep of collecting microarchitecture-independent characteristics would greatlyimprove our methodology’s usability. Possi-ble ways to accomplish this include sampling,hardware acceleration, and multithreadedprofiling. Second, as we have demon-strated, a microarchitecture-independentworkload characterization is more accurateand informative than a microarchitecture-dependent characterization. Ideally, however,the characterization should also be indepen-dent of the ISA and the compiler, to capturea program’s true inherent behavior. Makingthe characterization both ISA and compilerindependent is thus another interesting pathto explore. Third, an important question iswhat program characteristics to include in theanalysis. It is important that the characteriza-tion includes a sufficiently diverse set ofcharacteristics. For example, extending thecurrent set of characteristics to characterizemultithreaded workloads in a microarchitec-ture-independent manner is an area to focuson. Also, as architectures evolve, it isimportant to revisit these program character-istics. Finally, other statistical, machine-learning, or data-mining techniques mightbe useful for analyzing workload behaviorand extracting key program behavior char-acteristics. MICRO
AcknowledgmentsWe thank the anonymous reviewers for
their valuable comments. Kenneth Hostereceives support through a doctoral studentfellowship from the Institute for the Pro-motion of Innovation by Science andTechnology in Flanders (IWT-Flanders).Lieven Eeckhout receives support througha postdoctoral fellowship of the Fundfor Scientific Research, Flanders, Belgium
(FWO-Vlaanderen). We also thank KjellAndresen from the University of Oslo foroffering Alpha machine compute cycles.
Kenneth Hoste is a doctoral student in theElectronics and Information Systems De-partment of Ghent University, Belgium.His research interests include computerarchitecture in general and workload char-
acterization in particular. Hoste has an MSin computer science from Ghent University.
Lieven Eeckhout is an assistant professor inthe Electronics and Information SystemsDepartment of Ghent University, Belgium.His research interests include computerarchitecture, virtual machines, performanceanalysis and modeling, and workload char-acterization. Eeckhout has a PhD in com-puter science and engineering from GhentUniversity. He is a member of the IEEE.
Direct questions and comments about thisarticle to Kenneth Houste and LievenEeckhout, ELIS, Ghent University, Sint-Pieternieuwstraat 41, B-9000 Gent, Belgium;[email protected] and [email protected].