Towards Resiliency Evaluation of Vector Programsvcsharma/pubs/dpdns2016-shrg.pdf · Towards Resiliency Evaluation of Vector Programs Vishal Chandra Sharma, Ganesh Gopalakrishnan School

Towards Resiliency Evaluation of Vector ProgramsVishal Chandra Sharma, Ganesh Gopalakrishnan

School of Computing, University of UtahEmail: {vcsharma,ganesh}@cs.utah.edu

Sriram KrishnamoorthyPacific Northwest National Laboratory

Email: [email protected]

Abstract—The systems resilience research community has de-veloped methods to manually insert additional source-programlevel assertions to trap errors, and also devised tools to conductfault injection studies for scalar program codes. In this work,we contribute the first vector oriented LLVM-level fault injectorVULFI to help study the effects of faults in vector architecturesthat are of growing importance, especially for vectorizing loops.Using VULFI, we conduct a resiliency study of nine real-worldvector benchmarks using Intel’s AVX and SSE extensions asthe target vector instruction sets, and offer the first reportedunderstanding of how faults affect vector instruction sets. We takethis work further toward automating the insertion of resilienceassertions during compilation. This is based on our observationthat during intermediate (e.g., LLVM-level) code generation tohandle full and partial vectorization, modern compilers exploit(and explicate in their code-documentation) critical invariants.These invariants are turned into error-checking code. We confirmthe efficacy of these automatically inserted low-overhead errordetectors for vectorized for loops.

I. INTRODUCTION

The growing importance of resilience to bit-flip inducederrors demands concerted and prompt action by the researchcommunity. Recent studies have shown that the soft errorrates (SER) are expected to rise, and become one of themajor causes for reducing the mean-time-to-failure (MTTF)in exascale systems [1], [2]. While additional or hardenedhardware circuitry can prevent the onset of failures or detectthem early in the stack, with the ever increasing push forpower-efficient architectures, cross-layer resilience solutionsthat encompass all of the hardware and software stack arecritically important to be developed. Given that resiliencesolutions typically incur a net drop in performance, it iscritically important that the research community develop andmake available a whole array of solutions that can then beevaluated over an adequate period of time by practitioners,so that the right combinations of solutions that minimizeoverheads may be chosen. It is also equally important for theresearch community to boost each others’ efforts by creatingand sharing the necessary tooling support so that the solutionsbecome available in a timely manner.

One of the central tools that most research groups bank onare bit-flip based fault injectors. These are tools that inex-pensively and conveniently simulate bit-flips so that one may,using them, evaluate the efficacy of fault detection and cor-rection mechanisms. There are many types of fault injectors:(1) direct beam studies, where chips and systems are subjectto high-energy beams [3], (2) architectural level simulationplatforms (e.g., SIMICS) where very detailed studies can be

performed [4], (3) binary instruction level fault injectors (e.g.,PIN-based injectors [5]), (4) LLVM-level fault injectors [6]–[8], and (5) other higher level fault injectors, including thosethat simulate faults at the source-code level [3].

While each fault injector type suits specific needs and whileno one fault injector type subsumes another, there is increasinginterest (as well as merit) in focusing on LLVM-based faultinjectors: (1) these injectors are relatively easy to maintain,modify, and share; (2) LLVM [9], [10] offers increasingsupport for program analysis passes; (3) there is a significantgrowth in interest on part of the HPC community to adoptLLVM-based solutions, and (quite importantly for this paper)(4) LLVM extensions for supporting vector instruction types,and special hooks to pass information to downstream tools ison the ascent. Given all these points, our strong preference isto focus on LLVM-level fault injectors. In fact, we are oneof the groups that has developed a fault injector called theKontrollable Utah LLVM Fault Injector (KULFI, [6]). Thisinjector has been put to use in several of our own projects [11],and by outside groups [12]. Calhoun et al. utilized KULFI todevelop a new fault injector ‘FlipIt’ targeting MPI applications[13]. To our best knowledge, these fault injectors do not targetvector instructions.

Encouraged by these successes, we set about to find outwhether today’s resilience solutions are adequately developedto support exascale computing. We discovered many importantareas requiring immediate attention:(1) Even though vector architectures are central to gainingpower efficiencies, none of the available fault injectors addressthese architectures. In order to support the research communityin this area, we worked on not only making KULFI modular,but also equipping it with a rich array of features that support(or could be easily extended to support) multiple vectorformats (including Intel’s SSE and AVX vector instructionset). The resulting fault injector ‘VULFI’1 (Vector orientedUtah LLVM Fault Injector) is our first main result.

(2) Vectorization is primarily supported in the form of: (i)language specific vector extensions which are supported inmodern compilers such as GCC [14] and Clang [15], and(ii) dedicated programming languages with inbuilt support fordata-level parallelism such as ISPC [16] and OpenCL [17]that can enable vectorization to occur more predictably andunder programmer control. Thus, languages such as ISPC and

1VULFI is available at: http://formalverification.cs.utah.edu/fmr/vulfi

OpenCL, and their associated compilers must be part of theresearch focus; we found no prior work targeting this angle.

(3) There is the intriguing possibility that compiler writersof such languages have taken care to explain their codegenerator, and even provided some hints on the invariantsbeing followed when specific situations are handled – forinstance, partial vectorization supported by bit masking. Weinvested a significant amount of time studying ispc’s publiclyavailable code generator (well-supported in this effort by thecompiler team who answered our questions promptly), andfound that it offers another intriguing wrinkle in terms of faultdetector synthesis: namely that these invariants could be turnedinto inexpensive error detectors. This achieves two purposes atonce: (1) One may actually be able to exploit specific patternsduring code generation to tune and generate low overhead errordetectors. While no single error detector type is sufficient totrap all types of faults (and our detectors are no exception),the attraction of error detectors that incur low overheads,are effective at trapping many faults, and (last but not least)can be automatically generated and inserted is potentially ofhuge interest in transferring resilience research into practice.(2) Dialog between the resilience research community and thecompiler community may be a two-way street in that knowingthe needs of the resilience community, compiler writers may beencouraged to document their compiler-backends more and/orprovide features that support the generation of even more lowoverhead resilience solutions.

Roadmap: In §II, we detail VULFI, capable of targetinginstructions at LLVM’s intermediate representation (IR) level,and simulating soft errors occurring in a CPU’s vector units.Targeting vector instructions for fault injection requires thecapability to distinguish between unmasked and masked vectorinstructions including architecture specific LLVM intrinsics2.This distinction is crucial in deciding whether or not to targeta particular vector lane for a fault injection. Also, given thata vector register consists of a fixed number of packed scalarregisters, a systematic approach is developed to allow each ofthese scalar registers to be treated independently during faultinjection.In §III, we demonstrate extracting IR-level loop invariants fora foreach loop supported in the ISPC compiler in orderto synthesize error detectors. Our findings highlight that theunderstanding of underlying code generation is central todiscovering these invariants. We introduce our fault models,set up our definitions, and provide an overview of our casestudies.In §IV, we evaluate how well our detector types cover impor-tant situations in practice. We present a fault injection drivecase study, done using VULFI, analyzing resiliency of ninediverse C++ and ISPC vector benchmarks. We also employ theIR-level loop invariants to build soft error detectors, reportingtheir efficacy and overhead, using Intel’s open source ISPC[16], [18] as the language and the compiler of choice.

2Intrinsics referred in this paper are listed in the Intrinsics.gen filedistributed with the public release of LLVM 3.2.

II. VULFI: FAULT INJECTOR HANDLING VECTORINSTRUCTIONS

A. Terminology and Assumptions

We now define our terminology as well as some of ourdefault assumptions.Vector and scalar instructions: An LLVM IR instruction willbe referred to as an “instruction.” A vector instruction has atleast one vector type operand.3 A scalar instruction has novector operand.Vector and scalar registers: A vector register is an Lvalueregister or a source operand of an instruction of vector type. A scalar register is an Lvalue register or source operandregister of an instruction that has type integer, floatingpoint, or pointer.Vector length: The length of a vector register (Vl) is thenumber of scalar registers referred to within it.getelementptr: At IR level, the address of an element ofan aggregate data-structure, such as an array, is calculatedusing getelementptr instruction.Vector instruction – extractelement: It extracts a scalarelement from a given location of a vector register.Vector instruction – insertelement: It inserts a scalarelement at a given location of a vector register.Intrinsics: An intrinsic in this paper refers to a specialfunction whose implementation is provided by the LLVMcompiler infrastructure. All LLVM intrinsics start with a [email protected] generation, Architecture: We refer to IR-level codegenerated by the ISPC compiler with -O3 optimization tar-geting x86.

B. Fault model

We consider a single-bit fault introduced at a random bitposition of a CPU’s vector or scalar registers holding eitherinteger or floating point values during operations such as: (1)moving values between registers, (2) arithmetic and logicaloperations, and (3) load/store operations. To provide coveragefor these scenarios, we always target the Lvalue of aninstruction with the exception of a store instruction whichdoesn’t have an Lvalue. This fault injection approach lets ussimulate a variety of fault scenarios. For example, in the caseof an arithmetic instruction, corrupting the Lvalue coversthe scenarios where a bit-flip either occurs in one of thesource operands of the instruction or in the arithmetic unit.For the case where a value is moved from a source register toa destination register, targeting the Lvalue covers both thescenarios where a bit-flip occurs either in the source registeror in the destination register. Targeting the Lvalue of a loadinstruction covers the scenario where a bit-flip occurs in theload buffer. For a store instruction, we ensure that a value tobe stored is considered for fault injection prior to the storeoperation.

3The data types referred to in this paper correspond to the type definitionsprovided in http://llvm.org/docs/LangRef.html

Target Program

ISPC

Clang

Program Bitcode

Compiler Frontend

LLVM

Site SelectorInstrumented

Bitcode

VULFI Instrumentor

VULFI Driver

Fault Injection Report LLVM or ISPC Module

VULFI Module

Fig. 1: VULFI Design

Pure-data Sites

Address Sites

Control Sites

Fig. 2: Relationship between different fault sitecategories

We inject exactly one fault during the whole execution spanof a program. More specifically, for a given program executedunder a given input, and having N dynamic fault sites, exactlyone fault site is selected at random with a uniform probabilityof 1/N for the fault injection. Here, a dynamic fault site refersto a fault site associated with a runtime instance of a staticinstruction. Similar fault models have been used in recent faultinjection studies on scalar architectures [6], [19], [20]. VULFIuses this fault model for performing fault injections. It firstbuilds a list of fault sites using the Lvalues of the targetinstructions. If an Lvalue is a vector register, then each ofits scalar elements is considered a unique fault site. The targetinstructions are selected based on one of the fault site selectionheuristics (§ II-C). Each fault site from the target list is theninstrumented with a call instruction which invokes one of theVULFI’s runtime fault injection API functions.

C. Fault site selection

VULFI uses one of the following fault site selection heuris-tics to build an initial list of fault sites to be targeted for faultinjections. Specifically, VULFI analyzes the forward slice ofa fault site, to classify it into one of the following categories:

1) Pure-data sites: The forward slice of the fault site mustnot have any getelementptr (address calculation) orcontrol-flow instructions.

2) Control sites: The forward slice must have at least onecontrol-flow instruction.

3) Address sites: The forward slice must have at least onegetelementptr.

For example, in Figure 3, a bit-flip occurring in the variablei may cause the loop execution to either end prematurely or

void foo(int a[], int n, int x){int s = x;for(int i=0;i<n;i++){

a[i] = a[i] * s;s = s + i;

}}

Fig. 3: An example C++ function foo()

run greater than n iterations. It may also become an out-of-bound index for the array a[] thereby potentially causing aninvalid memory reference. However, a bit-flip occurring in thevariable s will never affect the loop control neither will itcause an invalid memory reference. Therefore, the variable iis an example of both a control site and an address site whereasthe variable s is an example of pure-data site. Figure 2 moreformally shows how these three fault site categories relate.

D. Instrumentation workflow

Figure 4 shows the instrumentation workflow of VULFIusing an example of a vector register of length four with eachof its elements considered a unique fault site. VULFI performsfollowing three key operations as part of the instrumentationprocess: (1) Iterates over each of the scalar element in thecloned value of the vector register; (2) In each step, VULFIextracts an uninstrumented scalar element (represented by awhite circle), performs instrumentation, and inserts the result(represented as a solid black circle) into the vector register;

(3) Finally, VULFI replaces the original vector register withits new cloned and instrumented version, redirecting all theusers of the original vector register.

Figure 5(A) illustrates this on a masked vector loadoperation followed by a masked vector store operation. Thevector load and store operations are done using the x86intrinsics @llvm.x86.avx.maskload.ps.256 [email protected], respectively.VULFI maintains an inbuilt list of x86 intrinsics, whichclassifies whether any given intrinsic performs a maskedvector operation. In the current example, this information isused to ascertain that both the intrinsics use execution maskvalue of the %floatmask.i register to enable or disableload or store operations along the vector lanes. Figure 5(B)shows the instrumented version of the vector load and storeoperations. Each scalar element of the vector register %0(the chosen fault site for instrumentation) is extracted usingextractelement instruction (locations L1 and L5). Therespective execution mask values of the scalar elements isextracted from the vector register %floatmask.i (locationsL2 and L6). An extracted element and its execution maskvalue is then passed on to the runtime fault injection API(injectFaultFloatTy() at location L7 in the currentexample) to perform actual fault injection at runtime.

III. ERROR DETECTORS USING LLVM IR-LEVELINVARIANTS

We now describe two specific instances of how compilationmethods can be exploited to generate error-checking invariantsbased on the code-generation logic of compilers.

A. Example 1: Loop Invariants in a foreach Loop Construct

An ISPC foreach loop accepts one or more dimensionvariables4of integer types with the iteration space of eachdimension variable bounded by an interval [start, end]. Aforeach loop uses its dimension variables as iterators toiterate over the loop body. To maximize lane utilization, fora given dimension variable, ISPC uniformly distributes first{n − (n%Vl)} loop iterations across Vl vector lanes, wheren = end − start. The rest of the n%Vl loop iterations arehandled separately.

Consider Figure 7 which presents the control-flow graph(CFG) of the vector copy function, vcopy_ispc of Figure 6.The uniform qualifier appearing in vcopy_ispc denotesthat all vector lanes share the same address of arrays a1 anda2, as well as the variable n. The foreach_full_bodybasic block executes {n − (n%Vl)} times with all Vl

vector lanes performing parallel copy operations. The re-maining n%Vl loop iterations are done in the basic blockpartial_inner_all_only. The values {n − (n%Vl)}and n%Vl are represented by the definitions aligned_endand nextras respectively in the entry basic block allocas.The definition new_counter is the loop iterator for the

4In this paper, we have considered foreach loops with only one dimen-sion variable, but our findings are applicable to the cases with more than onedimension variables.

foreach_full_body basic block. Based on these facts,one can construct the following loop invariants for foreachconstructs, as shown in Figure 8. Clearly, such invariants mustalways hold within, as well as upon exit from, foreach loopappearing in an ISPC program (to minimize overheads, wecheck them only upon exit).

B. Example 2: Protecting uniform variables

In ISPC, a uniform variable is shared across all vectorlanes. The compiler achieves this by storing a uniform valuefirst into a scalar register and then broadcasting it to a vectorregister. In Figure 9 (typical result of compiling a code blockcontaining a uniform variable), uval is a scalar registerstoring a uniform value. This value is first copied to the firstelement of the vector register uval_broadcast_init,and subsequently broadcast to all other locations usingshufflevector instruction. The resultant value is storedin the vector register uval_broadcast. A bit flip affectingany of the scalar elements of uval_broadcast can bedetected by inserting a piece of checker code which ensuresthat all scalar elements hold the same value before everyread from uval_broadcast (inexpensively achieved byXORing.) Such detectors can provide good, though not perfect,fault detection coverage at very low cost.

We have implemented a prototype LLVM transformationpass implementing the detector described in §III-A (imple-menting the detector described in §III-B will be part ofour future work). Our prototype implementation automaticallyinserts a detector basic block for each occurrence of foreachloop in a program (Figure 8 highlights this block, namelyforeach_fullbody_check_invariants). This blockcontains a call instruction which calls our runtime detectorAPI that takes new_counter, aligned_end, and Vl asarguments. As noted earlier, we invoke the detector block onlyupon loop exit. Our preliminary results (discussed in §IV) arequite encouraging.

IV. EXPERIMENTAL RESULTS

A. Experimental Setup

The experiments described in the paper were carried outon an Intel’s Core™i7 4770 system running 64-bit Ubuntu12.04 operating system and with 16GB of main memory. TheVULFI development is done with LLVM version 3.2 and theISPC benchmark programs are compiled using ISPC compilerversion 1.8.1.

B. Execution strategy

A fault injection experiment involves executing a benchmarkprogram twice using a randomly selected program input cho-sen from a predefined set of inputs. During the first execution,no faults are injected, the execution output is recorded, anda dynamic fault site is chosen at random from a list ofdynamic fault sites. The dynamic fault sites list is built byselecting either pure-data sites, control sites, or address sites.The second execution involves actual fault injection into thedynamic fault site chosen during the earlier run, using the fault

A vector of length 4Clone value

Instrument the scalar element at index 0. . .

User 1 User N

Instrument the scalar element at index 1



. . .User 1 User N

Fig. 4: VULFI instrumentation workflow

model described in the § II-B. VULFI classifies the result of afault injection experiment into one of the following categoriesby comparing the execution outputs from the two programexecutions:

1) Silent Data Corruption (SDC) : When the result ofthe faulty execution differs from that of the fault-freeexecution.

2) Benign : When no difference is observed between theexecutions.

3) Crash : When the faulty execution results in a systemfailure, a program crash, or any other issue that couldeasily be detected by the end user.

C. Benchmarks

Table I lists 9 benchmarks that we use for our fault injectionexperiments using VULFI. Benchmarks fluidanimate andswaptions are drawn from the PARVEC benchmark suite[21], and are vectorized implementations of their respective se-rial versions available with the PARSEC benchmark suite [22],[23]. Benchmarks Blackscholes, Sorting, Stencils,and Ray tracing, are selected from the list of benchmarksavailable with the public release of the ISPC compiler. Theremaining three benchmarks are our own ISPC implementa-tions of the respective C++ versions made available as part ofscientific computing library (SCL) by Burkardt [24]. For ourfault injection studies, we target the fault sites correspondingto each vectorized function of these benchmark programs. Thefault sites are selected using one of the heuristics explained in§II-C. The benchmarks are evaluated using AVX and SSE4 astarget x86 vector instruction sets.Benchmark Characteristics: Figure 10 shows the mix ofscalar and vector instructions for all 9 benchmarks. A signifi-

cant portion of instructions under pure-data and control faultsite categories are vector instructions. Specifically, the numberof vector instructions, averaged across all 9 benchmarks,stands at 67% and 43% for pure-data and control fault sitecategories. A seemingly low percentage of vector instructionsunder address category should be taken with a grain of saltbecause at the IR-level, a scalar address is frequently cast intoa vector address as and when required to be used by a vectorinstruction. These details clearly highlight the importance ofhaving a fault injector that can target vector instructions—something we achieved by creating VULFI.

D. Fault Injection Study

Table I summarizes the average number of dynamic instruc-tions observed for each of our benchmarks. In fact, the averagedynamic instruction count easily runs into millions for mostof the benchmarks. Given that fault-injection based studiesrequire huge numbers of runs on top of such high dynamicinstruction counts, a key goal is to minimize the overallcomputational effort by employing statistical measures, butstill maintaining a high degree of confidence in the reportedresults. To this end, our fault injection study is done byperforming statistically significant fault injection campaignsfor each of the benchmark programs.

A fault injection campaign comprises 100 independent faultinjection experiments. The SDC rate calculated for a faultinjection campaign is considered a unique random sample. Werun a sufficient number of fault injection campaigns until: (1)the sample distribution becomes normal or near normal, and(2) for a target confidence level of 95%, the margin of errorfor the distribution falls within the range of ±3%. We observethat for each of our benchmarks, running 20 fault injection

(A)L0: %0 = tail call <8 x float> @llvm.x86.avx.maskload.ps.256(

i8* %array_ld_addr, <8 x float> %floatmask.i)

L1: call void @llvm.x86.avx.maskstore.ps.256(i8* %array_str_addr,<8 x float> %floatmask.i, <8 x float> %0)

(B)L0: %0 = tail call <8 x float> @llvm.x86.avx.maskload.ps.256(

i8* %array_ld_addr, <8 x float> %floatmask.i)

L1: %ext0 = extractelement <8 x float> %0, i32 0L2: %extmask0 = extractelement <8 x float> %floatmask.i, i32 0L3: %inj0 = call float @injectFaultFloatTy(float %ext0,

float %extmask)L4: %ins0 = insertelement <8 x float> %0, float %inj0, i32 0

...

...

L5: %ext7 = extractelement <8 x float> %ins96, i32 7L6: %extmask7 = extractelement <8 x float> %floatmask.i, i32 7L7: %inj7 = call float @injectFaultFloatTy(float %ext7,

float %extmask7)L8: %ins7 = insertelement <8 x float> %ins96, float %inj7, i32 7

L9: call void @llvm.x86.avx.maskstore.ps.256(i8* %array_str_addr,<8 x float> %floatmask.i, <8 x float> %ins7)

Fig. 5: Uninstrumented and instrumented (A,B) masked vector load and store instrns.

campaigns for each of the fault site categories (pure-data,control, and address), is sufficient to achieve a 95% confidencelevel with a margin of error of ±3%. More specifically, eachrow entry in the table refers to 60 fault injection campaigns, 20each for pure-data, control, and address fault site categories(for both AVX and SSE4). This pushes the total number offault injection experiments to 9 × 2 × 3 × 2000 = 108, 000.The margin of error is calculated by applying the standard t-value based formula where the sample size and the standarderror of the sample distribution is known [25].

Figure 11 shows that among all benchmarks, Stencil andBlackscholes witness the highest rates of SDC, whereasSwaptions and Conjugate Gradient have the lowestSDC rates across all three fault site categories. Along expectedlines (as these examples are array-intensive), the addressfault site category results in the most number of programcrashes. However, for benchmarks Sorting, Stencil, andChebyshev, the address category also generates a significantnumber of SDCs. In fact, in case of Chebyshev benchmark,the SDC rate under the address category is highest amongall three fault site categories. The result clearly establishesSwaptions and Conjugate Gradient benchmarks asthe most resilient programs witnessing least number of SDCsand crashes.

E. Error Detection StudyWe evaluate the efficacy of the error detectors based on

foreach loop invariants described in §III-A on micro-benchmarks vector copy, vector dot product, and vector sum.

Due to the (relatively) smaller size of these benchmarks,we follow a more comprehensive evaluation strategy by car-rying out 2000 fault injection experiments for each of themicro-benchmarks under each of the fault site categoriespure-data, control, and address. The detector’s effec-tiveness is measured in terms of percentage of fault injectionexperiments that ends up in SDCs, together with the numberthat get flagged by our detectors.

The loop invariants in Figure 8 depend on the IR-level loopiterator new_counter. The value of this loop iterator isused to evaluate the loop exit condition, and also to calculatethe addresses of the array elements referenced in the micro-benchmarks. Therefore, fault sites affecting the loop iteratorwill be categorized as either a control site or an address site orboth but can never be a pure-data site adhering to the relationshown in Figure 2.

Figure 12 confirms our hypothesis, showing that no SDCsare detected when pure-data sites are targeted for fault in-jection. In contrast, faults affecting control sites lead to thehighest SDC rates, namely 96.5% for vector sum. In addition,48.7% of the total fault injection experiments that end up inSDCs are also successfully detected.

Overall, the highest SDC detection rate is witnessed undercontrol site category, with detectors approaching a detectionrate of 57% for both vector copy and dot product micro-benchmarks. Faults affecting address sites report a relativelylow SDC rate. This is because a substantial number of faultinjection experiments end up in program crashes.

void vcopy_ispc(uniform int a1[], uniform int a2[], uniform int n){

foreach(i = 0 ... n){ a2[i]=a1[i];

} return;

}

Fig. 6: ISPC implementation of vector copy

T F

allocas. . .%nextras = srem i32 %n, 8 %aligned_end = sub i32 %n, %nextras . . .

foreach_full_body.lr.ph. . .

partial_inner_all_outer. . .

foreach_full_body%counter = phi i32 [ 0, %foreach_full_body.lr.ph ], [ %new_counter, %foreach_full_body ]. . .%new_counter = add i32 %counter, 8. . .

T F

partial_inner_only. . .

call void @checkInvariantsForeachFullBody(i32 %new_counter, i32 %aligned_end, i32 8) br label %partial_inner_all_outer

foreach_fullbody_check_invariants

foreach_reset. . .

Detector Block

Fig. 7: Control flow graph of the vcopy_ispc() function with a detector block inserted

Invariant 1: new_counter ≥ 0Invariant 2: new_counter ≤ aligned_endInvariant 3: (new_counter % Vl) == 0

Fig. 8: Loop invariants for foreach_full_body basic block

%uval_broadcast_init = insertelement <8 x float> undef,float %uval, i32 0

%uval_broadcast = shufflevector <8 x float> %uval_broadcast_init,<8 x float> undef, <8 x i32> zeroinitializer

Fig. 9: Broadcasting the value of the uniform variable uval to a vector register

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

AV

XSS

EA

VX

SSE

AV

XSS

EA

VX

SSE

AV

XSS

EA

VX

SSE

AV

XSS

EA

VX

SSE

AV

XSS

EA

VX

SSE

AV

XSS

EA

VX

SSE

AV

XSS

EA

VX

SSE

AV

XSS

EA

VX

SSE

AV

XSS

EA

VX

SSE

AV

XSS

EA

VX

SSE

AV

XSS

EA

VX

SSE

AV

XSS

EA

VX

SSE

AV

XSS

EA

VX

SSE

AV

XSS

E

Puredata

Ctrl Addr Puredata

Ctrl Addr Puredata

Ctrl Addr Puredata

Ctrl Addr Puredata

Ctrl Addr Puredata

Ctrl Addr Puredata

Ctrl Addr Puredata

Ctrl Addr Puredata

Ctrl Addr

Fluidanimate Swaptions Blackscholes Sorting Stencil Chebyshev Jacobi CG RaytracingScalar Instructions Vector Instructions

Fig. 10: Composition of vector and scalar instructions in the benchmark programs

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

AVX SS

EA

VX SSE

AVX SS

EA

VX SSE

AVX SS

EA

VX SSE

AVX SS

EA

VX SSE

AVX SS

EA

VX SSE

AVX SS

EA

VX SSE

AVX SS

EA

VX SSE

AVX SS

EA

VX SSE

AVX SS

EA

VX SSE

AVX SS

EA

VX SSE

AVX SS

EA

VX SSE

AVX SS

EA

VX SSE

AVX SS

EA

VX SSE

AVX SS

E

Puredata

Ctrl Addr Puredata

Ctrl Addr Puredata

Ctrl Addr Puredata

Ctrl Addr Puredata

Ctrl Addr Puredata

Ctrl Addr Puredata

Ctrl Addr Puredata

Ctrl Addr Puredata

Ctrl Addr

Fluidanimate Swaptions Blackscholes Sorting Stencil Chebyshev Jacobi CG Raytracing

SDC Benign Crash

Fig. 11: Result of fault injection experiments for the vector benchmark programs

8.60% 8.09% 8.39%

99.95%

75.30%

39.45%

97.80% 95.25%

41.95%

100.00%96.50%

43.25%

57.10%

8.75%

57.65%

8.00%

48.70%

5.50%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Pure-data Ctrl Addr Pure-data Ctrl Addr Pure-data Ctrl Addr

vector copy dot product vector sum

Avg. Overhead SDC SDC Detection Rate

Fig. 12: SDC detection rate on micro-benchmarks using the invariants based detectors

Benchmark Program Language Test Input Target

Average DynamicInstruction Count

(in millions)

Parvec

Fluidanimate C++ sim smallsim medium

AVX 170.8

SSE 199.7

Swaptions C++Swaptions:

[16,64

]Simulations:

[100,200

] AVX 19.7

SSE 16.0

ISPC

Blackscholes ISPCsim small

sim mediumsim large

AVX 2.0

SSE 1.9

Sorting ISPC1D Array length:[

1000,100000] AVX 5.9

SSE 5.4

Stencil ISPC2D Array dimension:

Min. – 16x16Max. – 64x64

AVX 57.4

SSE 69.3

Ray tracing ISPCCamera input:

Sponza, Teapot, CornellAVX 69.6

SSE 68.8

SCL

Chebyshev ISPC Degree:[1,256

] AVX 1.8

SSE 0.8

Jacobi ISPC2D Array dimension:

Min. – 32x32Max. – 192x192

AVX 52.0

SSE 44.5

ConjugateGradient ISPC

2D Array dimension:Min. – 32x32

Max. – 256x256

AVX 45.6

SSE 43.6

TABLE I: List of Benchmarks used in the Fault Injection Study

The overhead incurred by our detectors is measured byexecuting and comparing the runtimes of an instrumentedprogram binary with and without the detector block inserted.Average overhead is calculated by averaging the overhead datafrom 2,000 individual runs for each micro-benchmark. A lowaverage overhead of approximately 8% is witnessed acrossall three micro-benchmarks. We believe that with increasingoperation counts inside the foreach loop body (compared toour very short loop bodies), the overhead introduced by thesedetector blocks will further get amortized.

The examples presented in §III and the preliminary resultsdiscussed here have been quite encouraging, and we believethat we have barely scratched the possibility-space of exploit-ing compilation-aware detectors. This line of work, especiallytargeting vector instruction sets and SPMD languages, will bethe main focus of our future work.

V. OTHER RELATED WORK

We have discussed pertinent related work throughout thepaper. Here we present additional related research efforts.

A recent fault injection study done by Hari et al. introducesan assembly-level fault injector SASSIFI built specifically forNVIDIA’s CUDA architecture [26]. Another CUDA-centricfault injection study done by Fang et al. introduces a new faultinjector GPU-Qin [27] which uses the CUDA debugger tool toinsert breakpoints at program locations where fault injectionsare to be done. In contrast to these fault injectors, VULFItargets vector instructions at LLVM-level with the awarenessfor architecture specific vector extensions such as Intel’s AVXand SSE instruction sets. We believe that by targeting LLVM-level vector instructions enables VULFI to support any SPMDfront-end compiler which uses LLVM as the back-end formachine code generation.

VI. CONCLUSION & FUTURE WORK

Despite the flurry of research underway in system resilience,very few solutions (including our past contributions) havebeen adopted and put into practice. The ”sticker shock” ofsuffering a flat-out ∼25%–2× overhead (typical figures forvarious error detectors and dual modular redundancy) can beunpalatable; a practitioner might prefer going back to an olderlithography, suffering less errors (and overhead). The burdenof manually inserting detectors into the source code can alsohinder adoption. Finally, the non-availability of detector-types(and even the means for conducting studies) pertaining tovector instruction sets and SPMD languages further hindersadoption.

Our primary contribution in this paper is a well-engineeredfault injector (VULFI) catering to systematic fault injectionstudies for vector instruction architectures. While we cannotsay that our overheads are still within the ballpark of unques-tioned acceptance, the detectors prove to be surprisingly light-weight, can be automatically generated and inserted duringcompilation, and may, in the grand scheme of things, providethe right kind and level of solution. Given that this is the first(as far as we know) vector-oriented fault injection frameworkand first attempt to exploit the logic of compilation to assistwith system resilience, but also given that we now see a hugearray of possibilities to expand our initial studies in our futurework, we believe that this direction has a very good chanceof yielding artifacts that will transition into practice.

VII. ACKNOWLEDGEMENT

This work was supported in part by the U.S. Departmentof Energys (DOE) Office of Science, Office of AdvancedScientific Computing Research, under award 66905. Pacific

Northwest National Laboratory is operated by Battelle forDOE under Contract DE-AC05-76RL01830. The Utah authorswere supported under the same DOE project with award num-ber 55800790, and also NSF Award CCF 1255776. We wouldalso like to thank the developers of Intel’s ISPC compiler,especially Matt Pharr and Dmitry Babokin, for their greatsupport in fixing a bug and promptly answering our queries.

REFERENCES

[1] F. Cappello, A. Geist, B. Gropp, L. Kale, B. Kramer, and M. Snir, “To-ward Exascale Resilience,” International Journal of High PerformanceComputing Applications, vol. 23, no. 4, pp. 374–388, 2009.

[2] F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer, and M. Snir,“Toward Exascale Resilience: 2014 update,” Supercomputing frontiersand innovations, vol. 1, no. 1, pp. 5–28, 2014.

[3] C. Cher, M. S. Gupta, P. Bose, and K. P. Muller, “Understanding SoftError Resiliency of Blue Gene/Q Compute Chip through HardwareProton Irradiation and Software Fault Injection,” in SC, 2014, pp. 587–596.

[4] M.-L. Li, P. Ramachandran, U. R. Karpuzcu, S. K. S. Hari, and S. V.Adve, “Accurate Microarchitecture-level Fault Modeling for StudyingHardware Faults,” HPCA, pp. 105–116, Feb. 2009.

[5] A. Jin, J. Jiang, J. Hu, and J. Lou, “A PIN-Based Dynamic SoftwareFault Injection System,” in International Conference for Young Com-puter Scientists, 2008, pp. 2160–2167.

[6] V. C. Sharma, A. Haran, Z. Rakamaric, and G. Gopalakrishnan, “To-wards Formal Approaches to System Resilience,” in PRDC, 2013.

[7] A. Thomas and K. Pattabiraman, “LLFI: An Intermediate Code LevelFault Injector For Soft Computing Applications,” 2013, informal pro-ceedings.

[8] H. Schirmeier, M. Hoffmann, C. Dietrich, M. Lenz, D. Lohmann, andO. Spinczyk, “Fail*: An Open and Versatile Fault-Injection Frameworkfor the Assessment of Software-Implemented Hardware Fault Toler-ance,” in EDCC, Sep. 2015.

[9] C. Lattner and V. Adve, “LLVM: A compilation framework for lifelongprogram analysis & transformation,” in International Symposium onCode Generation and Optimization (CGO), 2004, pp. 75–86.

[10] “The LLVM Compiler Infrastructure,” http://llvm.org/.[11] V. C. Sharma, Z. Rakamaric, and G. Gopalakrishnan, “FUSED: A Low-

cost online Soft-Error Detector,” in 10th SELSE, 2014.[12] S. Chen, L. Peng, and G. Bronevetsky, “A Framework For Evaluating

Comprehensive Fault Resilience Mechanisms In Numerical Programs,”J. Supercomputing, 2015.

[13] J. Calhoun, L. Olson, and M. Snir, “FlipIt: An LLVM Based FaultInjector for HPC,” in Euro-Par 2014, 2014, vol. LNCS 8805, pp. 547–558.

[14] “GCC, the GNU Compiler Collection,” https://gcc.gnu.org/.[15] “Clang: a C language family frontend for LLVM,” http://clang.llvm.org/.[16] “Intel SPMD Program Compiler,” https://ispc.github.io/index.html.[17] “The open standard for parallel programming of heterogeneous systems,”

https://www.khronos.org/opencl/.[18] M. Pharr and W. Mark, “ispc: A SPMD compiler for high-performance

CPU programming,” in Innovative Parallel Computing (InPar), 2012,May 2012, pp. 1–13.

[19] D. S. Khudia and S. A. Mahlke, “Low Cost Control Flow ProtectionUsing Abstract Control Signatures,” in LCTES, 2013, pp. 3–12.

[20] Q. Lu, K. Pattabiraman, M. Gupta, and J. Rivers, “SDCTune: A Modelfor Predicting the SDC Proneness of an Application for ConfigurableProtection,” in CASES, Oct 2014, pp. 1–10.

[21] J. Cebrian, M. Jahre, and L. Natvig, “Optimized Hardware for Sub-optimal Software: The Case for SIMD-aware Benchmarks,” in ISPASS,March 2014, pp. 66–75.

[22] C. Bienia, “Benchmarking modern multiprocessors,” Ph.D. dissertation,Princeton University, January 2011.

[23] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The PARSEC BenchmarkSuite: Characterization and Architectural Implications,” in 17th PACT,2008, pp. 72–81.

[24] “Open Source Scientific Computing Library,” http://people.sc.fsu.edu/∼jburkardt/.

[25] N. A. Weiss, Elementary Statistics, 8th ed. Pearson, 2011.[26] S. K. S. Hari, T. Tsai, M. Stephenson, S. W. Keckler, and J. Emer,

“SASSIFI: Evaluating Resilience of GPU Applications,” in 11th SELSE,2015.

[27] B. Fang, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi, “GPU-Qin : A Methodology for Evaluating the Error Resilience of GPGPU

Applications,” in ISPASS, 2014.

Towards Resiliency Evaluation of Vector Programsvcsharma/pubs/dpdns2016-shrg.pdf · Towards Resiliency Evaluation of Vector Programs Vishal Chandra Sharma, Ganesh Gopalakrishnan School

Documents