A Few Billion Lines of Code Later -Coverity

66COMMUNI CATI ONSOFTHEACM| FEBRUARY2010| VOL. 53| NO. 2contributed articlesIN2002, COVERITYcommercialized3 a research static bug-nding tool.6,9 Not surprisingly, as academics, our view of commercial realities was not perfectly accurate. However, the problems we encountered were not the obvious ones. Discussions with tool researchers and system builders suggest we were not alone in our navet. Here, we document some of the more important examples of what we learned developing and commercializing an industrial-strength bug-nding tool. We built our tool to nd generic errors (such as memory corruption and data races) and system-specic or interface-specic violations (such as violations of function-ordering constraints). The tool, DOI : 10. 1145/1646353. 1646374How Coverity built a bug-nding tool, anda business, around the unlimited supplyof bugs in software systems. BY AL BESSEY, KEN BLOCK, BEN CHELF, ANDY CHOU,BRYAN FULTON, SETH HALLEM, CHARLES HENRI-GROS,ASYA KAMSKY, SCOTT MCPEAK, AND DAWSON ENGLER A Few Billion Lines ofCode LaterUsing Static Analysisto Find Bugs inthe Real World likeallstaticbugnders,leveraged thefactthatprogrammingrulesoften mapclearlytosourcecode;thusstatic inspectioncanndmanyoftheirvio-lations. For example, to check the rule acquiredlocksmustbereleased,a checker would look for relevant opera-tions(suchasl ock( ) andunl ock( ))and inspect the code path after agging rule disobedience (such as l ock( )with no unl ock( )and double locking). Forthosewhokeeptrackofsuch things, checkers in the research system typicallytraverseprogrampaths(ow-sensitive) in a forward direction, going across function calls (inter-procedural) while keeping track of call-site-specic information(context-sensitive)and towardtheendoftheefforthadsome of the support needed to detect when a path was infeasible (path-sensitive). Aglancethroughtheliteraturere-veals many ways to go about static bug nding.1,2,4,7,8,11Forus,thecentralre-ligionwasresults:Ifitworked,itwas good,andifnot,not.Theideal:check millionsoflinesofcodewithlittle manualsetupandndthemaximum number of serious true errors with the minimumnumberoffalsereports.As much as possible, we avoided using an-notationsorspecicationstoreduce manual labor. LikethePRExproduct,2wewere also unsound. Our product did not veri-fy the absence of errors but rather tried to nd as many of them as possible. Un-soundness let us focus on handling the easiest cases rst, scaling up as it proved useful. We could ignore code constructs that led to high rates of false-error mes-sages(falsepositives)oranalysiscom-plexity,intheextremeskippingprob-lematic code entirely (such as assembly statements,functions,orevenentire les).Circa2000,unsoundnesswas controversial in the research communi-ty, though it has since become almost a de facto tool bias for commercial prod-ucts and many research projects. Initially,publishingwasthemain forcedrivingtooldevelopment.We would generally devise a set of checkers oranalysistricks,runthemoverafew CODEPROFILES BY W. BRADFORD PALEY, HTTP://DIDI.COMmillionlinesofcode(typicallyLinux), countthebugs,andwriteeverything up. Like other early static-tool research-ers,webenetedfromwhatseemsan empirical law: Assuming you have a rea-sonabletool,ifyourunitoveralarge, previouslyuncheckedsystem,you willalwaysndbugs.Ifyoudont,the immediateknee-jerkreactionisthat something must be wrong. Miscongu-ration?Mistakewithmacros?Wrong compilationtarget?Ifprogrammers mustobeyarulehundredsoftimes, thenwithoutanautomaticsafetynet they cannot avoid mistakes. Thus, even our initial effort with primitive analysis found hundreds of errors. This is the research context. We now describethecommercialcontext.Our rough view of the technical challenges of commercializationwasthatgiventhat thetoolwouldregularlyhandlelarge amountsofrealcode,weneeded only a pretty box; the rest was a business issue. This view was nave. While we in-clude many examples of unexpected ob-stacleshere,theydevolvemainlyfrom consequences of two main dynamics: First,intheresearchlabafewpeo-plecheckafewcodebases;inreality manycheckmany.Theproblemsthat showupwhenthousandsofprogram-mersuseatooltocheckhundreds(or eventhousands)ofcodebasesdonot show up when you and your co-authors checkonlyafew.Theresultofsum-ming many independent random vari-ables?AGaussiandistribution,most ofitnotonthepointsyousawand adaptedtointhelab.Furthermore, Gaussiandistributionshavetails.As the number of samples grows, so, too, doestheabsolutenumberofpoints severalstandarddeviationsfromthe mean. The unusual starts to occur with increasing frequency. W. Bradford Paleys CodeProles wasoriginally commissioned for the WhitneyMuseum of American Arts CODeDOCExhibition and later included in MoMAsDesign and the Elastic Mind exhibition.CodeProles explores the space of codeitself; the program reads its source intomemory, traces three points as they once moved through that space, then prints itself on the page.68COMMUNI CATI ONSOFTHEACM| FEBRUARY2010| VOL. 53| NO. 2contributed articlesForcode,thesefeaturesinclude problematicidioms,thetypesoffalse positivesencountered,thedistance ofadialectfromalanguagestandard, andthewaythebuildworks.Forde-velopers, variations appear in raw abil-ity,knowledge,theamounttheycare aboutbugs,falsepositives,andthe typesofboth.Agivencompanywont deviateinallthesefeaturesbut,given the number of features to choose from, oftenincludesatleastoneweirdodd-ity.Weirdisnotgood.Toolswantex-pected. Expected you can tune a tool to handle;surpriseinteractsbadlywith tuning assumptions. Second, in the lab the users values, knowledge,andincentivesarethose Such champions make sales as easily as their antithesis blocks them. However, sincetheirmainrequirementstendto betechnical(thetoolmustwork)the readerlikelyseeshowtomakethem happy, so we rarely discuss them here. Mostofourlessonscomefromtwo different styles of use: the initial trial of the tool and how the company uses the tool after buying it. The trial is a pre-sale demonstrationthatattemptstoshow thatthetoolworkswellonapotential customerscode.Wegenerallyshipa salesperson and an engineer to the cus-tomerssite.Theengineercongures thetoolandrunsitoveragivencode base and presents results soon after. Ini-tially,thecheckingrunwouldhappen ofthetoolbuilder,sincetheuserand thebuilderarethesameperson.De-ploymentleadstoseveression;us-ersoftenhavelittleunderstandingof thetoolandlittleinterestinhelping developit(forreasonsrangingfrom simpleskepticismtoperversereward incentives) and typically label any error message they nd confusing as false. A toolthatworkswellunderthesecon-straintslooksverydifferentfromone tool builders design for themselves. However,foreveryuserwholacks theunderstandingormotivationone might hope for, another is eager to un-derstand how it all works (or perhaps al-ready does), willing to help even beyond whatonemightconsiderreasonable. contributed articlesFEBRUARY2010| VOL. 53| NO. 2| COMMUNI CATI ONSOFTHEACM69in the morning, and the results meeting wouldfollowintheafternoon;ascode sizeattrialsgrowsitsnotuncommon to split them across two (or more) days. Sending people to a trial dramatical-lyraisestheincrementalcostofeach sale.However,itgivesthenon-trivial benet of letting us educate customers (so they do not label serious, true bugs asfalsepositives)anddoreal-time,ad hocworkaroundsofweirdcustomer system setups. The trial structure is a harsh test for anytool,andthereislittletime.The checkedsystemislarge(millionsof linesofcode,with2030MLOCapos-sibility). The code and its build system arebothdifculttounderstand.How-sion to conditions likely to be true in a larger setting. Laws of Bug Finding Thefundamentallawofbugnding isNoCheck=NoBug.Ifthetoolcant check a system, le, code path, or given property,thenitwontndbugsinit. Assumingareasonabletool,therst order bound on bug counts is just how muchcodecanbeshovedthroughthe tool.Tentimesmorecodeis10times more bugs. We imagined this law was as simple astatementoffactasweneeded.Un-fortunately, two seemingly vacuous cor-ollariesplaceharshrst-orderbounds on bug counts: Law:Youcantcheckcodeyoudont see. It seems too trite to note that check-ing code requires rst nding it... until youtrytodosoconsistentlyonmany large code bases. Probably the most re-liable way to check a system is to grab its code during the build process; the build system knows exactly which les are in-cludedinthesystemandhowtocom-pile them. This seems like a simple task. Unfortunately, its often difcult to un-derstandwhatanadhoc,homegrown build system is doing well enough to ex-tract this information, a difculty com-pounded by the near-universal absolute edict: No, you cant touch that. By de-fault,companiesrefusetoletanexter-nalforcemodifyanything;youcannot modifytheircompilerpath,theirbro-ken makeles (if they have any), or in any way write or recongure anything other than your own temporary les. Which is ne,sinceifyouneedtomodifyit,you most likely wont understand it. Further,forisolation,companies ofteninsistonsettingupatestma-chineforyoutouse.Asaresult,not infrequentlythebuildyouaregivento checkdoesnotworkintherstplace, whichyouwouldgetblamedforifyou had touched anything. Our approach in the initial months ofcommercializationin2002wasa low-tech, read-only replay of the build commands:runmake,recorditsout-putinale,andrewritetheinvoca-tionstotheircompiler(suchasgcc) to instead call our checking tool, then reruneverything.Easyandsimple. This approach worked perfectly in the lab and for a small number of our ear-liestcustomers.Wethenhadthefol-ever,thetoolmustroutinelygofrom neverseeingthesystempreviouslyto getting good bugs in a few hours. Since we present results almost immediately afterthecheckingrun,thebugsmust begoodwithfewfalsepositives;there is no time to cherry pick them. Furthermore,theerrormessages must be clear enough that the sales en-gineer(whodidntbuildthechecked systemorthetool)candiagnoseand explaintheminrealtimeinresponse to What about this one? questions. The most common usage model for theproducthascompaniesrunitas partoftheirnightlybuild.Thus,most require that checking runs complete in 12 hours, though those with larger code bases(10+MLOC)grudginglyaccept 24hours.Atoolthatcannotanalyze atleast1,400linesofcodeperminute makes it difcult to meet these targets. During a checking run, error messages areputinadatabaseforsubsequent triaging,whereuserslabelthemas true errors or false positives. We spend signicant effort designing the system sotheselabelsareautomaticallyreap-pliediftheerrormessagetheyreferto comes up on subsequent runs, despite code-dilatingeditsoranalysis-chang-ing bug-xes to checkers. Asofthiswriting(December2009), approximately700customershave licensedtheCoverityStaticAnalysis product,withsomewhatmorethana billionlinesofcodeamongthem.We estimate that since its creation the tool hasanalyzedseveralbillionlinesof code, some more difcult than others. Caveats. Drawing lessons from a sin-gledatapointhasobviousproblems. Ourproductsrequirementsroughly formaleastcommondenominator setneededbyanytoolthatusesnon-trivialanalysistochecklargeamounts of code across many organizations; the tool must nd and parse the code, and usersmustbeabletounderstander-rormessages.Further,therearemany waystohandletheproblemswehave encountered,andourwaymaynotbe thebestone.Wediscussourmethods moreforspecicitythanasaclaimof solution. Finally,whilewehavehadsuccess asastatic-toolscompany,theseare smallsteps.Wearetinycomparedto maturetechnologycompanies.Here, too,wehavetriedtolimitourdiscus- CODEPROFILES BY W. BRADFORD PALEY70COMMUNI CATI ONSOFTHEACM| FEBRUARY2010| VOL. 53| NO. 2contributed articleslowingconversationwithapotential customer: How do we run your tool? Justtypemakeandwellrewrite its output. Whats make? We use Cl ear Case. Uh, Whats Cl ear Case? Thisturnedouttobeachasmwe couldntcross.(Strictlyspeaking,the customerusedCl ear Make,butthe supercial similarities in name are en-tirelyunhelpfulatthetechnicallevel.) Weskippedthatcompanyandwent toafewothers.Theyexposedother problemswithourmethod,whichwe paperedoverwith90%hacks.None seemedsotroublesomeastoforceus to rethink the approachat least until wegotthefollowingsupportcallfrom a large customer: WhyisitwhenIrunyourtool,I have to reinstall my Linux distribution from CD? Thiswasindeedapuzzlingques-tion. Some poking around exposed the followingchainofevents:thecompa-nys make used a novel format to print outtheabsolutepathofthedirectory inwhichthecompilerran;ourscript misparsedthispath,producingthe empty string that we gave as the desti-nation to the Unix cd (change direc-tory)command,causingittochange tothetoplevelofthesystem;itran r m - r f * (recursive delete) during compilationtocleanuptemporary les; and the build process ran as root. Summingthesepointsproducesthe removal of all les on the system. Therightapproach,whichwehave usedforthepastsevenyears,kicksoff thebuildprocessandinterceptsevery system call it invokes. As a result, we can seeeverythingneededforchecking,in-cludingtheexactexecutablesinvoked, theircommandlines,thedirectory they run in, and the version of the com-piler(neededforcompiler-bugwork-arounds).Thiscontrolmakesiteasyto grab and precisely check all source code, to the extent of automatically changing the language dialect on a per-le basis. Toinvokeourtoolusersneedonly call it with their build command as an argument: cov- bui l dWe thought this approach was bullet-proof. Unfortunately, as the astute read-erhasnoted,itrequiresacommand prompt. Soon after implementing it we went to a large company, so large it had a hyperspecialized build engineer, who engaged in the following dialogue: How do I run your tool? Oh, its easy. Just type cov- bui l d before your build command. Buildcommand?Ijustpushthis [GUI] button... Social vs. technical. The social restric-tionthatyoucannotchangeanything, no matter how broken it may be, forces uglyworkarounds.Arepresentativeex-ampleis:BuildinterpositiononWin-dowsrequiresrunningthecompilerin thedebugger.Unfortunately,doingso causes a very popular windows C++ com-pilerVisual Studio C++ .NET 2003to prematurelyexitwithabizarreerror message.Aftersomehigh-stressfuss-ing, it turns out that the compiler has a use-after-free bug, hit when code used a Microsoft-specic C language extension (certain invocations of its #usi ng direc-tive). The compiler runs ne in normal use;whenitreadsthefreedmemory, theoriginalcontentsarestillthere,so everythingworks.However,whenrun with the debugger, the compiler switch-es to using a debug malloc, which on eachf r eecallsetsthefreedmemory contents to a garbage value. The subse-quentreadreturnsthisvalue,andthe compilerblowsupwithafatalerror. Thesufcientlyperversereadercanno doubt guess the solution.a Law:Youcantcheckcodeyoucant parse.Checkingcodedeeplyrequires understandingthecodessemantics. The most basic requirement is that you parseit.Parsingisconsideredasolved problem. Unfortunately, this view is na-ve,rootedinthewidelybelievedmyth that programming languages exist. TheClanguagedoesnotexist;nei-therdoesJava,C++,andC#.Whilea language may exist as an abstract idea, andevenhaveapileofpaper(astan-dard)purportingtodeneit,astan-dard is not a compiler. What language do people write code in? The character stringsacceptedbytheircompiler. Further,theyequatecompilationwith certication. A le their compiler does a Immediatelyafterprocessstartupourtool writes0tothememorylocationofthein de-buggervariablethatthecompilerchecksto decide whether to use the debug mal l oc.A misunderstood explanation means the error is ignored or, worse, transmuted intoa false positive. contributed articlesFEBRUARY2010| VOL. 53| NO. 2| COMMUNI CATI ONSOFTHEACM71not reject has been certied as C code no matter how blatantly illegal its con-tents may be to a language scholar. Fed this illegal not-C code, a tools C front-endwillrejectit.Thisproblemisthe tools problem. Compoundingit(andothers)the personresponsibleforrunningthe tool is often not the one punished if the checked code breaks. (This person also oftendoesntunderstandthechecked code or how the tool works.) In particu-lar,sinceourtooloftenrunsaspartof thenightlybuild,thebuildengineer managing this process is often in charge ofensuringthetoolrunscorrectly. Many build engineers have a single con-crete metric of success: that all tools ter-minate with successful exit codes. They see Coveritys tool as just another speed bump in the list of things they must get through.Guesshowreceptivetheyare to xing code the ofcial compiler ac-cepted but the tool rejected with a parse error? This lack of interest generally ex-tends to any aspect of the tool for which they are responsible. Many(all?)compilersdivergefrom thestandard.Compilershavebugs.Or are very old. Written by people who mis-understandthespecication(notjust for C++). Or have numerous extensions. The mere presence of these divergences causesthecodetheyallowtoappear. IfacompileracceptsconstructX,then givenenoughprogrammersandcode, eventually X is typed, not rejected, then encasedinthecodebase,wherethe static tool will, not helpfully, ag it as a parse error. Thetoolcantsimplyignorediver-gentcode,sincesignicantmarkets are awash in it. For example, one enor-moussoftwarecompanyonceviewed conformanceasacompetitivedisad-vantage, since it would let others make tools usable in lieu of its own. Embed-dedsoftwarecompaniesmakegreat tool customers, given the bug aversion of their customers; users dont like it if their cars (or even their toasters) crash. Unfortunately, the space constraints in suchsystemsandtheirtightcoupling to hardware have led to an astonishing oeuvreofenthusiasticallyusedcom-piler extensions. Finally,insafety-criticalsoftware systems,changingthecompileroften requirescostlyre-certication.Thus, weroutinelyseetheuseofdecades-make two different things the same t ypedef char i nt;(Useless type name in empty decla-ration.)Andonewherereadabilitytrumps the language spec unsi gnedx= 0xdead _ beef ;(Invalidsufx_beefoninteger constant.) From the embedded space, creating a label that takes no space voi dx;(Storage size of x is not known.) Anotherembeddedexamplethat controls where the space comes from unsi gnedx@t ext ;(Stray @ in program.)A more advanced case of a nonstan-dard construct isI nt16Er rSet J ump(Er rJ umpBuf buf )= {0x4E40+ 15, 0xA085; } Ittreatsthehexadecimalvaluesof machine-code instructions as program source. Theawardformostwidelyusedex-tensionshould,perhaps,gotoMicro-softsupportforprecompiledheaders. Amongthemostnettlesometroubles isthatthecompilerskipsallthetext beforeaninclusionofaprecompiled header.Theimplicationofthisbehav-ioristhatthefollowingcodecanbe compiled without complaint:I canput what ever I want here.I t doesnt havet ocompi l e.I f your compi l er gi vesaner r or,i t sucks.#i ncl udeMicrosoftson-the-yheaderfabri-cation makes things worse. Assemblyisthemostconsistently troublesomeconstruct.Itsalready non-portable,socompilersseemto almostdeliberatelyuseweirdsyn-tax,makingitdifculttohandleina generalway.Unfortunately,ifapro-grammerusesassemblyitsprobably towriteawidelyusedfunction,and iftheprogrammerdoesit,themost likely place to put it is in a widely used oldcompilers.Whilethelanguages thesecompilersaccepthaveinterest-ingfeatures,strongconcordancewith a modern language standard is not one ofthem.Agebegetsnewproblems. Realistically,diagnosingacompilers divergencesrequireshavingacopyof thecompiler.Howdoyoupurchasea license for a compiler 20 versions old? Orwhosecompanyhasgoneoutof business?Notthroughnormalchan-nels.Wehaveliterallyresortedtobuy-ing copies off eBay. Thisdynamicshowsupinasofter way with non-safety-critical systems; the larger the code base, the more the sales force is rewarded for a sale, skewing sales towardsuchsystems.Largecodebases take a while to build and often get tied to the compiler used when they were born, skewing the average age of the compilers whose languages we must accept. If divergence-induced parse errors are isolated events scattered here and there, then they dont matter. An unsound tool can skip them. Unfortunately, failure of-ten isnt modular. In a sad, too-common story line, some crucial, purportedly C headerlecontainsablatantlyillegal non-Cconstruct.Itgetsincludedbyall les.Theno-longer-potentialcustomer istreatedtoaconstantstreamofparse errors as your compiler rips through the customerssourceles,rejectingeach inturn.Thecustomersderisivestance is,Deepsourcecodeanalysis?Your toolcantevencompilecode.Howcan itndbugs?Itmayndthiseventso amusing that it tells many friends. Tiny set of bad snippets seen in header les.Oneoftherstexamplesween-counteredofillegal-construct-in-key-headerlecameupatalargenetwork-ing company // redef i ni ti onof parameteravoi df oo(i nt a, i nt a);Theprogrammernamesf oosrst formalparameteraand,inaformof lexicallocality,thesecondaswell. Harmless.Butanyconformantcom-piler will reject this code. Our tool cer-tainlydid.Thisisnothelpful;compil-ing no les means nding no bugs, and peopledontneedyourtoolforthat. And, because its compiler accepted it, the potential customer blamed us. Heresanopposite,less-harmless case where the programmer is trying to 72COMMUNI CATI ONSOFTHEACM| FEBRUARY2010| VOL. 53| NO. 2contributed articlesheaderle.Herearetwoways(out ofmany)toissueamovinstruction // Fi r st wayf oo( ) { _ _ asm moveax, eab moveax, eab;} // Secondway#pr agmaasm_ _ asm [ moveax, eabmov eax, eab]#pr agmaend _ asmThe only thing shared in addition to mov is the lack of common textual keys that can be used to elide them. We have thus far discussed only C, a simple language; C++ compilers diverge toanevenworsedegree,andwegoto greatlengthstosupportthem.Onthe other hand, C# and Java have been eas-ier,sinceweanalyzethebytecodethey compile to rather than their source. How to parse not-C with a C front-end. OK,soprogrammersuseextensions. How difcult is it to solve this problem? Coverity has a full-time team of some of its sharpest engineers to reght this ba-nal,technicallyuninterestingproblem as their sole job. Theyre never done.b Wersttriedtomaketheproblem someoneelsesproblembyusingthe EdisonDesignGroup(EDG)C/C++ front-endtoparsecode.5EDGhas workedonhowtoparserealCcode since1989andisthedefactoindus-trystandardfront-end.Anyonedecid-ingtonotbuildahomegrownfront-end will almost certainly license from EDG.Allthosewhodobuildahome-grownfront-endwillalmostcertainly wishtheydidlicenseEDGafterafew experienceswithrealcode.EDGaims not just for mere feature compatibility butforversion-specicbugcompat-ibilityacrossarangeofcompilers.Its front-endprobablyresidesnearthe limit of what a protable company can do in terms of front-end gyrations. Unfortunately, the creativity of com-piler writers means that despite two de-cades of work EDG still regularly meets bAnecdotally,thedynamicmemory-checking toolPurify10hadananalogousstruggleatthe machine-code level, where Purifys developers expendedsignicantresourcesreverseengi-neeringthevariousactivation-recordlayouts used by different compilers.defeatwhentryingtoparsereal-world large code bases.c Thus, our next step is foreachsupportedcompiler,wewrite asetoftransformersthatmangle itspersonallanguageintosomething closertowhatEDGcanparse.The mostcommontransformationsimply ripsouttheoffendingconstruct.As onemeasureofhowmuchCdoesnot exist, the table here counts the lines of transformercodeneededtomakethe languagesacceptedby18widelyused compilers look vaguely like C. A line of transformercodewasalmostalways written only when we were burned to a degree that was difcult to work around. Adding each new compiler to our list of supportedcompilersalmostalways requireswritingsomekindoftrans-former.Unfortunately,wesometimes need a deeper view of semantics so are forced to hack EDG directly. This meth-od is a last resort. Still, at last count (as ofearly2009)thereweremorethan 406(!) places in the front-end where we had an #i f def COVERI TY to handle a specic, unanticipated construct. EDGiswidelyusedasacompiler front-end. One might think that for cus-tomersusingEDG-basedcompilerswe would be in great shape. Unfortunately, this is not necessarily the case. Even ig-noring the fact that compilers based on EDG often modify EDG in idiosyncratic ways,thereisnosingleEDGfront-end but rather many versions and pos-sible congurations that often accept a slightly different language variant than the (often newer) version we use. As a Si-sypheantwist,assumewecannotwork around and report an incompatibility. If EDG then considers the problem impor-tant enough to x, it will roll it together with other patches into a new version. So,togetourownx,wemustup-cCoveritywonthedubioushonorofbeingthe single largest source of EDG bug reports after only three years of use.gradetheversionweuse,oftencaus-ing divergence from other unupgraded EDG compiler front-ends, and more is-sues ensue. Social versus technical. Can we get cus-tomersourcecode?Almostalways,no. Despite nondisclosure agreements, even for parse errors and preprocessed code, thoughperhapsbecauseweareviewed astoosmalltosuetorecoupdamages. Asaresult,oursalesengineersmust type problems in reports from memory. This works as well as you might expect. Itsworseforperformanceproblems, whichoftenshowuponlyinlarge-code settings.Butoneshouldntcomplain, sinceclassiedsystemsmakethings evenworse.Canwesendsomeoneon-site to look at the code? No. You listen to recited syntax on the phone. Bugs Dobugsmatter?Companiesbuybug-ndingtoolsbecausetheyseebugsas bad. However, not everyone agrees that bugsmatter.Thefollowingeventhas occurredduringnumeroustrials.The toolndsaclear,uglyerror(memory corruptionoruse-after-free)inimpor-tant code, and the interaction with the customer goes like thus: So?Isntthatbad?Whathappensif you hit it?Oh,itllcrash.Wellgetacall. [Shrug.]Ifdevelopersdontfeelpain,they often dont care. Indifference can arise fromlackofaccountability;ifQAcan-notreproduceabug,thenthereisno blame. Other times, its just odd: Is this a bug? Im just the security guy. Thatsnotabug;itsinthird-party code. A leak? Dont know. The author left years ago... No,yourtoolisbroken;thatisnot abug.Givenenoughcode,anybug-Lines of code per transformer for 18 common compilers we support. 160 QNX 280 HP-UX 285 picc.cpp294 sun.java.cpp 384 st.cpp 334 cosmic.cpp421 intel.cpp 457 sun.cpp 603 iccmsa.cpp629 bcc.cpp 673 diab.cpp 756 xlc.cpp912 ARM 914 GNU 1294 Microsoft1425 keil.cpp 1848 cw.cpp 1665 Metrowerkscontributed articlesFEBRUARY2010| VOL. 53| NO. 2| COMMUNI CATI ONSOFTHEACM73ndingtoolwilluncoversomeweird examples.Givenenoughcoders, youllseethesamething.Thefol-lowingutteranceswereculledfrom trial meetings: Uponseeinganerrorreportsaying the following loop body was dead code f oo( i = 1; i < 0; i ++). . . deadcode. . .No, thats a false positive; a loop ex-ecutes at least once. Forthismemorycorruptionerror (32-bit machine) i nt a[2], b;memset( a, 0, 12);No, I meant to do that; they are next to each other. For this use-after-free f r ee( f oo);f oo- >bar = . . . ;No,thatsOK;thereisnomalloc call between the free and use. As a nal example, a buffer overow checker agged a bunch of errors of the form unsi gnedp[4];. . .p[4] = 1;No,ANSIletsyouwrite1pastthe end of the array. After heated argument, the program-mersaid,Wellhavetoagreetodis-agree.Wecouldagreeaboutthedis-agreement,thoughwecouldntquite comprehendit.The(subtle?)interplay between 0-based offsets and buffer siz-es seems to come up every few months. Whileprogrammersarenotoften soegregiouslymistaken,thegeneral trendholds;anot-understoodbug reportiscommonlylabeledafalse positive,ratherthanspurringthepro-grammertodelvedeeper.Theresult? Wehavecompletelyabandonedsome analysesthatmightgeneratedifcult-to-understand reports. How to handle cluelessness. You can-notoftenarguewithpeoplewhoare sufcientlyconfusedabouttechnical matters;theythinkyouaretheone who doesnt get it. They also tend to get emotional.Arguingreliablykillssales. What to do? One trick is to try to orga-nizealargemeetingsotheirpeersdo theworkforyou.Themorepeoplein the room, the more likely there is some-one very smart and respected and cares (about bugs and about the given code), can diagnose an error (to counter argu-mentsitsafalsepositive),hasbeen burned by a similar error, loses his/her bonus for errors, or is in another group (another potential sale). Further,alargerresultsmeeting increasestheprobabilitythatanyone laidoffatalaterdateattendeditand sawhowyourtoolworked.Truestory: Anetworkingcompanyagreedtobuy theCoverityproduct,andoneweek later laid off 110 people (not because of us).Goodorbad?Fortheredpeople itclearlywasntahappyday.However, ithadasurprisingresultforusata business level; when these people were hiredatothercompaniessomesug-gestedbringingthetoolinforatrial, resulting in four sales. Whathappenswhenyoucantx allthebugs?Ifyouthinkbugsarebad enoughtobuyabug-ndingtool,you will x them. Not quite. A rough heuris-ticisthatfewerthan1,000bugs,then xthem.More?Thebaselineistore-cordthecurrentbugs,dontxthem but do x any new bugs. Many compa-nies have independently come up with thispractice,whichismorerational than it seems. Having a lot of bugs usu-allyrequiresalotofcode.Muchofit wonthavechangedinalongtime.A reasonable,conservativeheuristicis ifyouhaventtouchedcodeinyears, dontmodifyit(evenforabugx)to avoid causing any breakage. Asurprisingconsequenceisitsnot uncommon for tool improvement to be viewedasbadoratleastaproblem. Pretend you are a manager. For anything bad you can measure, you want it to di-minishovertime.Thismeansyouare improving something and get a bonus. Youmaynotunderstandtechni-calissuesthatwell,andyourbosscer-tainly doesnt understand them. Thus, you want a simple graph that looks like Figure1;nomanagergetsabonusfor Figure 2. Representative story: At com-panyX,version2.4ofthetoolfound approximately2,400errors,andover time the company xed about 1,200 of them. Then it upgraded to version 3.6. Suddenlytherewere3,600errors.The managerwasfuriousfortworeasons: One, we undid all the work his people its notuncommon fortool improvementto be viewedas bad or atleast a problem.74COMMUNI CATI ONSOFTHEACM| FEBRUARY2010| VOL. 53| NO. 2contributed articleshad done, and two, how could we have missed them the rst time? Howdoupgradeshappenwhen morebugsisnogood?Companiesin-dependentlysettleonasmallnumber of upgrade models: Never. Guarantees improvement; Neverbeforearelease(whereitwould be most crucial). Counterintuitively hap-pensmostoftenincompaniesthatbe-lievethetoolhelpswithreleasequality in that they use it to gate the release; Never before a meeting. This is at least socially rational; Upgrade, then roll back. Seems to hap-penatleastonceatlargecompanies; and Upgradeonlycheckerswheretheyx mosterrors.Commoncheckersinclude use-after-free,memorycorruption, (sometimes)locking,and(sometimes) checkers that ag code contradictions.Domissederrorsmatter?Ifpeople dontxallthebugs,domissederrors (falsenegatives)matter?Ofcoursenot; theyareinvisible.Well,notalways. Commoncases:Potentialcustomers intentionallyintroducedbugsintothe system, asking Why didnt you nd it? Manycheckifyoundimportantpast bugs. The easiest sale is to a group whose code you are checking that was horribly burned by a specic bug last week, and you nd it. If you dont nd it? No mat-ter the hundreds of other bugs that may be the next important bug. Here is an open secret known to bug nders:Thesetofbugsfoundbytool AisrarelyasupersetofanothertoolB, evenifAismuchbetterthanB.Thus, thediscussiongetspushedfromAis betterthanBtoAndssomethings, B nds some things and does not help the case of A. Addingbugscanbeaproblem;los-ingalreadyinspectedbugsisalwaysa problem, even if you replace them with manymorenewerrors.Whileusers knowintheorythatthetoolisnota verier, its very different when the tool demonstrates this limitation, good and hard, by losing a few hundred known er-rors after an upgrade. The easiest way to lose bugs is to add justonetoyourtool.Abugthatcauses false negatives is easy to miss. One such buginhowourearlyresearchtools internalrepresentationhandledarray referencesmeanttheanalysisignored mostarrayusesformorethannine months.Inourcommercialproduct, blatantsituationslikethisareprevent-ed through detailed unit testing, but un-covering the effect of subtle bugs is still difcult because customer source code is complex and not available. Churn Usersreallywantthesameresultfrom runtorun.Eveniftheychangedtheir code base. Even if they upgraded the tool. Their model of error messages? Compil-er warnings. Classic determinism states: the same input + same function = same result. What users want: different input (modiedcodebase)+differentfunc-tion(toolversion)=sameresult.Asa result, we nd upgrades to be a constant headache.Analysischangescaneasily causethesetofdefectsfoundtoshift. The new-speak term we use internally is churn. A big change from academia is that we spend considerable time and en-ergy worrying about churn when modify-ing checkers. We try to cap churn at less than5%perrelease.Thisgoalmeans largeclassesofanalysistricksaredisal-lowed since they cannot obviously guar-antee minimal effect on the bugs found. Randomizationisverboten,atragedy given that it provides simple, elegant so-lutions to many of the exponential prob-lemsweencounter.Timeoutsarealso bad and sometimes used as a last resort but never encouraged. Myth:Moreanalysisisalwaysgood. Whilenondeterministicanalysismight causeproblems,itseemsthatadding moredeterministicanalysisisalways good. Bring on path sensitivity! Theorem proving! SAT solvers! Unfortunately, no. At the most basic level, errors found with little analysis are often better than errors found with deeper tricks. A good error is probable, a true error, easy to di-agnose; best is difcult to misdiagnose. As the number of analysis steps increas-es,so,too,doesthechanceofanalysis mistake,userconfusion,ortheper-ceived improbability of event sequence. No analysis equals no mistake. Further,explainingerrorsisoften moredifcultthanndingthem.A misunderstoodexplanationmeansthe errorisignoredor,worse,transmuted intoafalsepositive.Theheuristicwe follow: Whenever a checker calls a com-plicated analysis subroutine, we have to explain what that routine did to the user, and the user will then have to (correctly) manuallyreplicatethattrickythingin his/her head. Sophisticatedanalysisisnoteasyto explainorredomanually.Compound-ingtheproblem,usersoftenlacka stronggrasponhowcompilerswork. ArepresentativeuserquoteisStatic analysis? Whats the performance over-head? Theendresult?Sincetheanalysis thatsuppressesfalsepositivesisinvis-ible(itremoveserrormessagesrather than generates them) its sophistication has scaled far beyond what our research Figure 1. Bugs down overtime = manager bonus. timebadtimebadFigure 2. No bonus. contributed articlesFEBRUARY2010| VOL. 53| NO. 2| COMMUNI CATI ONSOFTHEACM75systemdid.Ontheotherhand,the commercialCoverityproduct,despite itsimprovements,lagsbehindthere-searchsysteminsomewaysbecauseit had to drop checkers or techniques that demandtoomuchsophisticationon thepartoftheuser.Asanexample,for many years we gave up on checkers that aggedconcurrencyerrors;whilend-ing such errors was not too difcult, ex-plainingthemtomanyuserswas.(The PRExsystemalsoavoidedreporting races for similar reasons though is now supported by Coverity.)Nobugistoofoolishtocheckfor.Giv-enenoughcode,developerswillwrite almostanythingyoucanthinkof.Fur-ther,completelyfoolisherrorscanbe some of the most serious; its difcult to be extravagantly nonsensical in a harm-lessway.Wevefoundmanyerrorsover theyears.Oneoftheabsolutebestwas the following in the X Window System: i f(get ui d()!=0 &&geteui d ==0){Er r or F(onl yr oot );exi t(1);}Itallowedanylocalusertogetroot accessdandgeneratedenormouspress coverage,includingamentiononFox news(theWebsite).Thecheckerwas written by Scott McPeak as a quick hack to get himself familiar with the system. It made it into the product not because of a perceived need but because there was no reason not to put it in. Fortunately. False Positives False positives do matter. In our experi-ence, more than 30% easily cause prob-lems. People ignore the tool. True bugs get lost in the false. A vicious cycle starts wherelowtrustcausescomplexbugs to be labeled false positives, leading to yet lower trust. We have seen this cycle triggered even for true errors. If people dont understand an error, they label it false. And done once, induction makes the(n+1)thtimeeasier.Weinitially thoughtfalsepositivescouldbeelimi-natedthroughtechnology.Becauseof this dynamic we no longer think so. Wevespentconsiderabletechnical dThetautologicalcheckget eui d== 0wasin-tendedtobeget eui d( ) == 0.Initscurrent form, it compares the address of get eui dto 0; giv-en that the function exists, its address is never 0.periencecoveredherewastheworkof many. We thank all who helped build the toolandcompanytoitscurrentstate, especiallythesalesengineers,support engineers,andservicesengineerswho took the product into complex environ-mentsandwereoftenthersttobear thebruntofproblems.Withoutthem therewouldbenocompanytodocu-ment.Weespeciallythankallthecus-tomerswhotoleratedthetoolduring itstransitionfromresearchqualityto productionqualityandthenumerous championswhoseinsightfulfeedback helped us focus on what mattered.References 1.Ball, T. and Rajamani, S.K. Automatically validating temporal safety properties of interfaces. In Proceedings of the Eighth international SPIN Workshop on Model Checking of Software (Toronto, Ontario, Canada). M. Dwyer, Ed. Springer-Verlag, New York, 2001, 103122.2.Bush, W., Pincus, J., and Sielaff, D. A static analyzer for nding dynamic programming errors. Software: Practice and Experience 30, 7 (June 2000), 775802. 3.Coverity static analysis; http://www.coverity.com 4.Das, M., Lerner, S., and Seigle, M. ESP: Path-sensitive program verication in polynomial time. In Proceedings of the ACM SIGPLAN 2002 Conference on Programming Language Design and Implementation (Berlin, Germany, June 1719). ACM Press, New York, 2002, 5768. 5.Edison Design Group. EDG C compiler front-end; http://www.edg.com 6.Engler, D., Chelf, B., Chou, A., and Hallem, S. Checking system rules using system-specic, programmer-written compiler extensions. In Proceedings of the Fourth Conference on Operating System Design & Implementation (San Diego, Oct. 2225). USENIX Association, Berkeley, CA, 2000, 11. 7.Flanagan, C., Leino, K.M., Lillibridge, M., Nelson, G., Saxe, J.B., and Stata, R. Extended static checking for Java. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (Berlin, Germany, June 1719). ACM Press, New York, 2002, 234245. 8.Foster, J.S., Terauchi, T., and Aiken, A. Flow-sensitive type qualiers. In Proceedings of the ACM SIGPLAN 2002 Conference on Programming Language Design and Implementation (Berlin, Germany, June 1719). ACM Press, New York, 2002, 112. 9.Hallem, S., Chelf, B., Xie, Y., and Engler, D. A system and language for building system-specic, static analyses. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (Berlin, Germany, June 1719). ACM Press, New York, 2002, 6982. 10.Hastings, R. and Joyce, B. Purify: Fast detection of memory leaks and access errors. In Proceedings of the Winter 1992 USENIX Conference (Berkeley, CA, Jan. 2024). USENIX Association, Berkeley, CA, 1992, 125138. 11.Xie, Y. and Aiken, A. Context- and path-sensitive memory leak detection. In Proceedings of the 10th European Software Engineering Conference Held Jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering (Lisbon, Portugal, Sept. 59). ACM Press, New York, 2005, 115125. Al Bessey, Ken Block, Ben Chelf, Andy Chou,Bryan Fulton, Seth Hallem, Charles Henri-Gros,Asya Kamsky, and Scott McPeak are current or former employees of Coverity, Inc., a software company basedin San Francisco, CA.; http://www.coverity.com Dawson Engler ([email protected]) is an associate professor in the Department of Computer Science and Electrical Engineering at Stanford University, Stanford, CA, and technical advisor to Coverity, Inc., San Francisco, CA. 2010 ACM 0001-0782/10/0200 $10.00effort to achieve low false-positive rates inourstaticanalysisproduct.Weaim forbelow20%forstablecheckers. Whenforcedtochoosebetweenmore bugsorfewerfalsepositiveswetypi-cally choose the latter. Talking about false positive rate is simplisticsincefalsepositivesarenot allequal.Theinitialreportsmatterin-ordinately; if the rst N reports are false positives(N=3?),peopletendtoutter variantsonThistoolsucks.Further-more,youneverwantanembarrass-ingfalsepositive.Astupidfalseposi-tive implies the tool is stupid. (Its not even smart enough to gure that out?) Thistechnicalmistakecancauseso-cial problems. An expensive tool needs someone with power within a company ororganizationtochampionit.Such peopleoftenhaveatleastoneenemy. You dont want to provide ammunition that would embarrass the tool champi-on internally; a false positive that ts in a punchline is really bad. Conclusion Whilewevefocusedonsomeofthe less-pleasantexperiencesinthecom-mercializationofbug-ndingprod-ucts,twopositiveexperiencestrump themall.First,sellingastatictoolhas becomedramaticallyeasierinrecent years. There has been a seismic shift in terms of the average programmer get-ting it. When you say you have a static bug-nding tool, the response is no lon-gerHuh?orLint?Yuck.Thisshift seems due to static bug nders being in wideruse,givingrisetonicenetwork-ing effects. The person you talk to likely knows someone using such a tool, has a competitor that uses it, or has been in a company that used it. Moreover,whileseeminglyvacuous tautologieshavehadanegativeeffect ontechnicaldevelopment,anicebal-ancingempiricaltautologyholdsthat bugndingisworthwhileforanyone withaneffectivetool.Ifyoucannd code,andthecheckedsystemisbig enough,andyoucancompile(enough of)it,thenyouwillalwaysndserious errors. This appears to be a law. We en-courage readers to exploit it. Acknowledgments We thank Paul Twohey, Cristian Cadar, and especially Philip Guo for their help-ful,last-minuteproofreading.Theex-

A Few Billion Lines of Code Later -Coverity

Documents

research static bugnding

tool researchers

tool bias

static bug nding

early statictool researchers

research communi

research projects

code path