66COMMUNI CATI ONSOFTHEACM| FEBRUARY2010| VOL. 53| NO.
2contributed articlesIN2002, COVERITYcommercialized3 a research
static bug-nding tool.6,9 Not surprisingly, as academics, our view
of commercial realities was not perfectly accurate. However, the
problems we encountered were not the obvious ones. Discussions with
tool researchers and system builders suggest we were not alone in
our navet. Here, we document some of the more important examples of
what we learned developing and commercializing an
industrial-strength bug-nding tool. We built our tool to nd generic
errors (such as memory corruption and data races) and system-specic
or interface-specic violations (such as violations of
function-ordering constraints). The tool, DOI : 10. 1145/1646353.
1646374How Coverity built a bug-nding tool, anda business, around
the unlimited supplyof bugs in software systems. BY AL BESSEY, KEN
BLOCK, BEN CHELF, ANDY CHOU,BRYAN FULTON, SETH HALLEM, CHARLES
HENRI-GROS,ASYA KAMSKY, SCOTT MCPEAK, AND DAWSON ENGLER A Few
Billion Lines ofCode LaterUsing Static Analysisto Find Bugs inthe
Real World likeallstaticbugnders,leveraged
thefactthatprogrammingrulesoften mapclearlytosourcecode;thusstatic
inspectioncanndmanyoftheirvio-lations. For example, to check the
rule acquiredlocksmustbereleased,a checker would look for relevant
opera-tions(suchasl ock( ) andunl ock( ))and inspect the code path
after agging rule disobedience (such as l ock( )with no unl ock(
)and double locking). Forthosewhokeeptrackofsuch things, checkers
in the research system typicallytraverseprogrampaths(ow-sensitive)
in a forward direction, going across function calls
(inter-procedural) while keeping track of call-site-specic
information(context-sensitive)and towardtheendoftheefforthadsome of
the support needed to detect when a path was infeasible
(path-sensitive). Aglancethroughtheliteraturere-veals many ways to
go about static bug
nding.1,2,4,7,8,11Forus,thecentralre-ligionwasresults:Ifitworked,itwas
good,andifnot,not.Theideal:check millionsoflinesofcodewithlittle
manualsetupandndthemaximum number of serious true errors with the
minimumnumberoffalsereports.As much as possible, we avoided using
an-notationsorspecicationstoreduce manual labor.
LikethePRExproduct,2wewere also unsound. Our product did not
veri-fy the absence of errors but rather tried to nd as many of
them as possible. Un-soundness let us focus on handling the easiest
cases rst, scaling up as it proved useful. We could ignore code
constructs that led to high rates of false-error
mes-sages(falsepositives)oranalysiscom-plexity,intheextremeskippingprob-lematic
code entirely (such as assembly statements,functions,orevenentire
les).Circa2000,unsoundnesswas controversial in the research
communi-ty, though it has since become almost a de facto tool bias
for commercial prod-ucts and many research projects.
Initially,publishingwasthemain forcedrivingtooldevelopment.We would
generally devise a set of checkers oranalysistricks,runthemoverafew
CODEPROFILES BY W. BRADFORD PALEY,
HTTP://DIDI.COMmillionlinesofcode(typicallyLinux),
countthebugs,andwriteeverything up. Like other early static-tool
research-ers,webenetedfromwhatseemsan empirical law: Assuming you
have a rea-sonabletool,ifyourunitoveralarge,
previouslyuncheckedsystem,you willalwaysndbugs.Ifyoudont,the
immediateknee-jerkreactionisthat something must be wrong.
Miscongu-ration?Mistakewithmacros?Wrong
compilationtarget?Ifprogrammers mustobeyarulehundredsoftimes,
thenwithoutanautomaticsafetynet they cannot avoid mistakes. Thus,
even our initial effort with primitive analysis found hundreds of
errors. This is the research context. We now
describethecommercialcontext.Our rough view of the technical
challenges of commercializationwasthatgiventhat
thetoolwouldregularlyhandlelarge amountsofrealcode,weneeded only a
pretty box; the rest was a business issue. This view was nave.
While we in-clude many examples of unexpected
ob-stacleshere,theydevolvemainlyfrom consequences of two main
dynamics:
First,intheresearchlabafewpeo-plecheckafewcodebases;inreality
manycheckmany.Theproblemsthat
showupwhenthousandsofprogram-mersuseatooltocheckhundreds(or
eventhousands)ofcodebasesdonot show up when you and your co-authors
checkonlyafew.Theresultofsum-ming many independent random
vari-ables?AGaussiandistribution,most ofitnotonthepointsyousawand
adaptedtointhelab.Furthermore, Gaussiandistributionshavetails.As
the number of samples grows, so, too, doestheabsolutenumberofpoints
severalstandarddeviationsfromthe mean. The unusual starts to occur
with increasing frequency. W. Bradford Paleys CodeProles
wasoriginally commissioned for the WhitneyMuseum of American Arts
CODeDOCExhibition and later included in MoMAsDesign and the Elastic
Mind exhibition.CodeProles explores the space of codeitself; the
program reads its source intomemory, traces three points as they
once moved through that space, then prints itself on the
page.68COMMUNI CATI ONSOFTHEACM| FEBRUARY2010| VOL. 53| NO.
2contributed articlesForcode,thesefeaturesinclude
problematicidioms,thetypesoffalse positivesencountered,thedistance
ofadialectfromalanguagestandard,
andthewaythebuildworks.Forde-velopers, variations appear in raw
abil-ity,knowledge,theamounttheycare
aboutbugs,falsepositives,andthe typesofboth.Agivencompanywont
deviateinallthesefeaturesbut,given the number of features to choose
from,
oftenincludesatleastoneweirdodd-ity.Weirdisnotgood.Toolswantex-pected.
Expected you can tune a tool to handle;surpriseinteractsbadlywith
tuning assumptions. Second, in the lab the users values,
knowledge,andincentivesarethose Such champions make sales as easily
as their antithesis blocks them. However,
sincetheirmainrequirementstendto betechnical(thetoolmustwork)the
readerlikelyseeshowtomakethem happy, so we rarely discuss them
here. Mostofourlessonscomefromtwo different styles of use: the
initial trial of the tool and how the company uses the tool after
buying it. The trial is a pre-sale demonstrationthatattemptstoshow
thatthetoolworkswellonapotential customerscode.Wegenerallyshipa
salesperson and an engineer to the
cus-tomerssite.Theengineercongures thetoolandrunsitoveragivencode
base and presents results soon after.
Ini-tially,thecheckingrunwouldhappen
ofthetoolbuilder,sincetheuserand
thebuilderarethesameperson.De-ploymentleadstoseveression;us-ersoftenhavelittleunderstandingof
thetoolandlittleinterestinhelping developit(forreasonsrangingfrom
simpleskepticismtoperversereward incentives) and typically label
any error message they nd confusing as false. A
toolthatworkswellunderthesecon-straintslooksverydifferentfromone
tool builders design for themselves. However,foreveryuserwholacks
theunderstandingormotivationone might hope for, another is eager to
un-derstand how it all works (or perhaps al-ready does), willing to
help even beyond whatonemightconsiderreasonable. contributed
articlesFEBRUARY2010| VOL. 53| NO. 2| COMMUNI CATI ONSOFTHEACM69in
the morning, and the results meeting
wouldfollowintheafternoon;ascode sizeattrialsgrowsitsnotuncommon to
split them across two (or more) days. Sending people to a trial
dramatical-lyraisestheincrementalcostofeach
sale.However,itgivesthenon-trivial benet of letting us educate
customers (so they do not label serious, true bugs
asfalsepositives)anddoreal-time,ad hocworkaroundsofweirdcustomer
system setups. The trial structure is a harsh test for
anytool,andthereislittletime.The checkedsystemislarge(millionsof
linesofcode,with2030MLOCapos-sibility). The code and its build
system arebothdifculttounderstand.How-sion to conditions likely to
be true in a larger setting. Laws of Bug Finding
Thefundamentallawofbugnding isNoCheck=NoBug.Ifthetoolcant check a
system, le, code path, or given property,thenitwontndbugsinit.
Assumingareasonabletool,therst order bound on bug counts is just
how muchcodecanbeshovedthroughthe tool.Tentimesmorecodeis10times
more bugs. We imagined this law was as simple
astatementoffactasweneeded.Un-fortunately, two seemingly vacuous
cor-ollariesplaceharshrst-orderbounds on bug counts:
Law:Youcantcheckcodeyoudont see. It seems too trite to note that
check-ing code requires rst nding it... until
youtrytodosoconsistentlyonmany large code bases. Probably the most
re-liable way to check a system is to grab its code during the
build process; the build system knows exactly which les are
in-cludedinthesystemandhowtocom-pile them. This seems like a simple
task. Unfortunately, its often difcult to
un-derstandwhatanadhoc,homegrown build system is doing well enough
to ex-tract this information, a difculty com-pounded by the
near-universal absolute edict: No, you cant touch that. By
de-fault,companiesrefusetoletanexter-nalforcemodifyanything;youcannot
modifytheircompilerpath,theirbro-ken makeles (if they have any), or
in any way write or recongure anything other than your own
temporary les. Which is ne,sinceifyouneedtomodifyit,you most likely
wont understand it. Further,forisolation,companies
ofteninsistonsettingupatestma-chineforyoutouse.Asaresult,not
infrequentlythebuildyouaregivento checkdoesnotworkintherstplace,
whichyouwouldgetblamedforifyou had touched anything. Our approach
in the initial months ofcommercializationin2002wasa low-tech,
read-only replay of the build
commands:runmake,recorditsout-putinale,andrewritetheinvoca-tionstotheircompiler(suchasgcc)
to instead call our checking tool, then
reruneverything.Easyandsimple. This approach worked perfectly in
the lab and for a small number of our
ear-liestcustomers.Wethenhadthefol-ever,thetoolmustroutinelygofrom
neverseeingthesystempreviouslyto getting good bugs in a few hours.
Since we present results almost immediately
afterthecheckingrun,thebugsmust begoodwithfewfalsepositives;there
is no time to cherry pick them. Furthermore,theerrormessages must
be clear enough that the sales en-gineer(whodidntbuildthechecked
systemorthetool)candiagnoseand explaintheminrealtimeinresponse to
What about this one? questions. The most common usage model for
theproducthascompaniesrunitas partoftheirnightlybuild.Thus,most
require that checking runs complete in 12 hours, though those with
larger code bases(10+MLOC)grudginglyaccept
24hours.Atoolthatcannotanalyze atleast1,400linesofcodeperminute
makes it difcult to meet these targets. During a checking run,
error messages areputinadatabaseforsubsequent
triaging,whereuserslabelthemas true errors or false positives. We
spend signicant effort designing the system
sotheselabelsareautomaticallyreap-pliediftheerrormessagetheyreferto
comes up on subsequent runs, despite
code-dilatingeditsoranalysis-chang-ing bug-xes to checkers.
Asofthiswriting(December2009), approximately700customershave
licensedtheCoverityStaticAnalysis product,withsomewhatmorethana
billionlinesofcodeamongthem.We estimate that since its creation the
tool hasanalyzedseveralbillionlinesof code, some more difcult than
others. Caveats. Drawing lessons from a
sin-gledatapointhasobviousproblems. Ourproductsrequirementsroughly
formaleastcommondenominator
setneededbyanytoolthatusesnon-trivialanalysistochecklargeamounts of
code across many organizations; the tool must nd and parse the
code, and
usersmustbeabletounderstander-rormessages.Further,therearemany
waystohandletheproblemswehave encountered,andourwaymaynotbe
thebestone.Wediscussourmethods moreforspecicitythanasaclaimof
solution. Finally,whilewehavehadsuccess
asastatic-toolscompany,theseare smallsteps.Wearetinycomparedto
maturetechnologycompanies.Here, too,wehavetriedtolimitourdiscus-
CODEPROFILES BY W. BRADFORD PALEY70COMMUNI CATI ONSOFTHEACM|
FEBRUARY2010| VOL. 53| NO. 2contributed
articleslowingconversationwithapotential customer: How do we run
your tool? Justtypemakeandwellrewrite its output. Whats make? We
use Cl ear Case. Uh, Whats Cl ear Case? Thisturnedouttobeachasmwe
couldntcross.(Strictlyspeaking,the customerusedCl ear Make,butthe
supercial similarities in name are
en-tirelyunhelpfulatthetechnicallevel.) Weskippedthatcompanyandwent
toafewothers.Theyexposedother problemswithourmethod,whichwe
paperedoverwith90%hacks.None seemedsotroublesomeastoforceus to
rethink the approachat least until wegotthefollowingsupportcallfrom
a large customer: WhyisitwhenIrunyourtool,I have to reinstall my
Linux distribution from CD? Thiswasindeedapuzzlingques-tion. Some
poking around exposed the followingchainofevents:thecompa-nys make
used a novel format to print outtheabsolutepathofthedirectory
inwhichthecompilerran;ourscript misparsedthispath,producingthe
empty string that we gave as the desti-nation to the Unix cd
(change direc-tory)command,causingittochange
tothetoplevelofthesystem;itran r m - r f * (recursive delete)
during compilationtocleanuptemporary les; and the build process ran
as root. Summingthesepointsproducesthe removal of all les on the
system. Therightapproach,whichwehave
usedforthepastsevenyears,kicksoff thebuildprocessandinterceptsevery
system call it invokes. As a result, we can
seeeverythingneededforchecking,in-cludingtheexactexecutablesinvoked,
theircommandlines,thedirectory they run in, and the version of the
com-piler(neededforcompiler-bugwork-arounds).Thiscontrolmakesiteasyto
grab and precisely check all source code, to the extent of
automatically changing the language dialect on a per-le basis.
Toinvokeourtoolusersneedonly call it with their build command as an
argument: cov- bui l dWe thought this approach was bullet-proof.
Unfortunately, as the astute read-erhasnoted,itrequiresacommand
prompt. Soon after implementing it we went to a large company, so
large it had a hyperspecialized build engineer, who engaged in the
following dialogue: How do I run your tool? Oh, its easy. Just type
cov- bui l d before your build command. Buildcommand?Ijustpushthis
[GUI] button... Social vs. technical. The social
restric-tionthatyoucannotchangeanything, no matter how broken it
may be, forces
uglyworkarounds.Arepresentativeex-ampleis:BuildinterpositiononWin-dowsrequiresrunningthecompilerin
thedebugger.Unfortunately,doingso causes a very popular windows C++
com-pilerVisual Studio C++ .NET 2003to
prematurelyexitwithabizarreerror
message.Aftersomehigh-stressfuss-ing, it turns out that the
compiler has a use-after-free bug, hit when code used a
Microsoft-specic C language extension (certain invocations of its
#usi ng direc-tive). The compiler runs ne in normal
use;whenitreadsthefreedmemory, theoriginalcontentsarestillthere,so
everythingworks.However,whenrun with the debugger, the compiler
switch-es to using a debug malloc, which on eachf r
eecallsetsthefreedmemory contents to a garbage value. The
subse-quentreadreturnsthisvalue,andthe
compilerblowsupwithafatalerror. Thesufcientlyperversereadercanno
doubt guess the solution.a Law:Youcantcheckcodeyoucant
parse.Checkingcodedeeplyrequires understandingthecodessemantics.
The most basic requirement is that you
parseit.Parsingisconsideredasolved problem. Unfortunately, this
view is na-ve,rootedinthewidelybelievedmyth that programming
languages exist.
TheClanguagedoesnotexist;nei-therdoesJava,C++,andC#.Whilea language
may exist as an abstract idea,
andevenhaveapileofpaper(astan-dard)purportingtodeneit,astan-dard is
not a compiler. What language do people write code in? The
character stringsacceptedbytheircompiler.
Further,theyequatecompilationwith certication. A le their compiler
does a Immediatelyafterprocessstartupourtool
writes0tothememorylocationofthein
de-buggervariablethatthecompilerchecksto decide whether to use the
debug mal l oc.A misunderstood explanation means the error is
ignored or, worse, transmuted intoa false positive. contributed
articlesFEBRUARY2010| VOL. 53| NO. 2| COMMUNI CATI ONSOFTHEACM71not
reject has been certied as C code no matter how blatantly illegal
its con-tents may be to a language scholar. Fed this illegal not-C
code, a tools C front-endwillrejectit.Thisproblemisthe tools
problem. Compoundingit(andothers)the personresponsibleforrunningthe
tool is often not the one punished if the checked code breaks.
(This person also oftendoesntunderstandthechecked code or how the
tool works.) In particu-lar,sinceourtooloftenrunsaspartof
thenightlybuild,thebuildengineer managing this process is often in
charge ofensuringthetoolrunscorrectly. Many build engineers have a
single con-crete metric of success: that all tools ter-minate with
successful exit codes. They see Coveritys tool as just another
speed bump in the list of things they must get
through.Guesshowreceptivetheyare to xing code the ofcial compiler
ac-cepted but the tool rejected with a parse error? This lack of
interest generally ex-tends to any aspect of the tool for which
they are responsible. Many(all?)compilersdivergefrom
thestandard.Compilershavebugs.Or are very old. Written by people
who mis-understandthespecication(notjust for C++). Or have numerous
extensions. The mere presence of these divergences
causesthecodetheyallowtoappear. IfacompileracceptsconstructX,then
givenenoughprogrammersandcode, eventually X is typed, not rejected,
then encasedinthecodebase,wherethe static tool will, not helpfully,
ag it as a parse error.
Thetoolcantsimplyignorediver-gentcode,sincesignicantmarkets are
awash in it. For example, one enor-moussoftwarecompanyonceviewed
conformanceasacompetitivedisad-vantage, since it would let others
make tools usable in lieu of its own.
Embed-dedsoftwarecompaniesmakegreat tool customers, given the bug
aversion of their customers; users dont like it if their cars (or
even their toasters) crash. Unfortunately, the space constraints in
suchsystemsandtheirtightcoupling to hardware have led to an
astonishing oeuvreofenthusiasticallyusedcom-piler extensions.
Finally,insafety-criticalsoftware systems,changingthecompileroften
requirescostlyre-certication.Thus,
weroutinelyseetheuseofdecades-make two different things the same t
ypedef char i nt;(Useless type name in empty
decla-ration.)Andonewherereadabilitytrumps the language spec unsi
gnedx= 0xdead _ beef ;(Invalidsufx_beefoninteger constant.) From
the embedded space, creating a label that takes no space voi
dx;(Storage size of x is not known.) Anotherembeddedexamplethat
controls where the space comes from unsi gnedx@t ext ;(Stray @ in
program.)A more advanced case of a nonstan-dard construct isI
nt16Er rSet J ump(Er rJ umpBuf buf )= {0x4E40+ 15, 0xA085; }
Ittreatsthehexadecimalvaluesof machine-code instructions as program
source.
Theawardformostwidelyusedex-tensionshould,perhaps,gotoMicro-softsupportforprecompiledheaders.
Amongthemostnettlesometroubles isthatthecompilerskipsallthetext
beforeaninclusionofaprecompiled
header.Theimplicationofthisbehav-ioristhatthefollowingcodecanbe
compiled without complaint:I canput what ever I want here.I t
doesnt havet ocompi l e.I f your compi l er gi vesaner r or,i t
sucks.#i ncl udeMicrosoftson-the-yheaderfabri-cation makes things
worse. Assemblyisthemostconsistently
troublesomeconstruct.Itsalready non-portable,socompilersseemto
almostdeliberatelyuseweirdsyn-tax,makingitdifculttohandleina
generalway.Unfortunately,ifapro-grammerusesassemblyitsprobably
towriteawidelyusedfunction,and iftheprogrammerdoesit,themost likely
place to put it is in a widely used oldcompilers.Whilethelanguages
thesecompilersaccepthaveinterest-ingfeatures,strongconcordancewith
a modern language standard is not one ofthem.Agebegetsnewproblems.
Realistically,diagnosingacompilers divergencesrequireshavingacopyof
thecompiler.Howdoyoupurchasea license for a compiler 20 versions
old? Orwhosecompanyhasgoneoutof
business?Notthroughnormalchan-nels.Wehaveliterallyresortedtobuy-ing
copies off eBay. Thisdynamicshowsupinasofter way with
non-safety-critical systems; the larger the code base, the more the
sales force is rewarded for a sale, skewing sales
towardsuchsystems.Largecodebases take a while to build and often
get tied to the compiler used when they were born, skewing the
average age of the compilers whose languages we must accept. If
divergence-induced parse errors are isolated events scattered here
and there, then they dont matter. An unsound tool can skip them.
Unfortunately, failure of-ten isnt modular. In a sad, too-common
story line, some crucial, purportedly C
headerlecontainsablatantlyillegal
non-Cconstruct.Itgetsincludedbyall
les.Theno-longer-potentialcustomer
istreatedtoaconstantstreamofparse errors as your compiler rips
through the customerssourceles,rejectingeach
inturn.Thecustomersderisivestance is,Deepsourcecodeanalysis?Your
toolcantevencompilecode.Howcan itndbugs?Itmayndthiseventso amusing
that it tells many friends. Tiny set of bad snippets seen in header
les.Oneoftherstexamplesween-counteredofillegal-construct-in-key-headerlecameupatalargenetwork-ing
company // redef i ni ti onof parameteravoi df oo(i nt a, i nt
a);Theprogrammernamesf oosrst formalparameteraand,inaformof
lexicallocality,thesecondaswell. Harmless.Butanyconformantcom-piler
will reject this code. Our tool
cer-tainlydid.Thisisnothelpful;compil-ing no les means nding no
bugs, and peopledontneedyourtoolforthat. And, because its compiler
accepted it, the potential customer blamed us.
Heresanopposite,less-harmless case where the programmer is trying
to 72COMMUNI CATI ONSOFTHEACM| FEBRUARY2010| VOL. 53| NO.
2contributed articlesheaderle.Herearetwoways(out
ofmany)toissueamovinstruction // Fi r st wayf oo( ) { _ _ asm
moveax, eab moveax, eab;} // Secondway#pr agmaasm_ _ asm [ moveax,
eabmov eax, eab]#pr agmaend _ asmThe only thing shared in addition
to mov is the lack of common textual keys that can be used to elide
them. We have thus far discussed only C, a simple language; C++
compilers diverge toanevenworsedegree,andwegoto
greatlengthstosupportthem.Onthe other hand, C# and Java have been
eas-ier,sinceweanalyzethebytecodethey compile to rather than their
source. How to parse not-C with a C front-end.
OK,soprogrammersuseextensions. How difcult is it to solve this
problem? Coverity has a full-time team of some of its sharpest
engineers to reght this ba-nal,technicallyuninterestingproblem as
their sole job. Theyre never done.b Wersttriedtomaketheproblem
someoneelsesproblembyusingthe EdisonDesignGroup(EDG)C/C++
front-endtoparsecode.5EDGhas workedonhowtoparserealCcode
since1989andisthedefactoindus-trystandardfront-end.Anyonedecid-ingtonotbuildahomegrownfront-end
will almost certainly license from
EDG.Allthosewhodobuildahome-grownfront-endwillalmostcertainly
wishtheydidlicenseEDGafterafew experienceswithrealcode.EDGaims not
just for mere feature compatibility
butforversion-specicbugcompat-ibilityacrossarangeofcompilers.Its
front-endprobablyresidesnearthe limit of what a protable company
can do in terms of front-end gyrations. Unfortunately, the
creativity of com-piler writers means that despite two de-cades of
work EDG still regularly meets
bAnecdotally,thedynamicmemory-checking
toolPurify10hadananalogousstruggleatthe machine-code level, where
Purifys developers
expendedsignicantresourcesreverseengi-neeringthevariousactivation-recordlayouts
used by different compilers.defeatwhentryingtoparsereal-world large
code bases.c Thus, our next step is
foreachsupportedcompiler,wewrite asetoftransformersthatmangle
itspersonallanguageintosomething closertowhatEDGcanparse.The
mostcommontransformationsimply ripsouttheoffendingconstruct.As
onemeasureofhowmuchCdoesnot exist, the table here counts the lines
of transformercodeneededtomakethe languagesacceptedby18widelyused
compilers look vaguely like C. A line of
transformercodewasalmostalways written only when we were burned to
a degree that was difcult to work around. Adding each new compiler
to our list of supportedcompilersalmostalways
requireswritingsomekindoftrans-former.Unfortunately,wesometimes
need a deeper view of semantics so are forced to hack EDG directly.
This meth-od is a last resort. Still, at last count (as
ofearly2009)thereweremorethan 406(!) places in the front-end where
we had an #i f def COVERI TY to handle a specic, unanticipated
construct. EDGiswidelyusedasacompiler front-end. One might think
that for cus-tomersusingEDG-basedcompilerswe would be in great
shape. Unfortunately, this is not necessarily the case. Even
ig-noring the fact that compilers based on EDG often modify EDG in
idiosyncratic ways,thereisnosingleEDGfront-end but rather many
versions and pos-sible congurations that often accept a slightly
different language variant than the (often newer) version we use.
As a Si-sypheantwist,assumewecannotwork around and report an
incompatibility. If EDG then considers the problem impor-tant
enough to x, it will roll it together with other patches into a new
version.
So,togetourownx,wemustup-cCoveritywonthedubioushonorofbeingthe
single largest source of EDG bug reports after only three years of
use.gradetheversionweuse,oftencaus-ing divergence from other
unupgraded EDG compiler front-ends, and more is-sues ensue. Social
versus technical. Can we get cus-tomersourcecode?Almostalways,no.
Despite nondisclosure agreements, even for parse errors and
preprocessed code, thoughperhapsbecauseweareviewed
astoosmalltosuetorecoupdamages. Asaresult,oursalesengineersmust
type problems in reports from memory. This works as well as you
might expect. Itsworseforperformanceproblems,
whichoftenshowuponlyinlarge-code settings.Butoneshouldntcomplain,
sinceclassiedsystemsmakethings evenworse.Canwesendsomeoneon-site to
look at the code? No. You listen to recited syntax on the phone.
Bugs Dobugsmatter?Companiesbuybug-ndingtoolsbecausetheyseebugsas
bad. However, not everyone agrees that
bugsmatter.Thefollowingeventhas occurredduringnumeroustrials.The
toolndsaclear,uglyerror(memory
corruptionoruse-after-free)inimpor-tant code, and the interaction
with the customer goes like thus: So?Isntthatbad?Whathappensif you
hit it?Oh,itllcrash.Wellgetacall.
[Shrug.]Ifdevelopersdontfeelpain,they often dont care. Indifference
can arise
fromlackofaccountability;ifQAcan-notreproduceabug,thenthereisno
blame. Other times, its just odd: Is this a bug? Im just the
security guy. Thatsnotabug;itsinthird-party code. A leak? Dont
know. The author left years ago... No,yourtoolisbroken;thatisnot
abug.Givenenoughcode,anybug-Lines of code per transformer for 18
common compilers we support. 160 QNX 280 HP-UX 285 picc.cpp294
sun.java.cpp 384 st.cpp 334 cosmic.cpp421 intel.cpp 457 sun.cpp 603
iccmsa.cpp629 bcc.cpp 673 diab.cpp 756 xlc.cpp912 ARM 914 GNU 1294
Microsoft1425 keil.cpp 1848 cw.cpp 1665 Metrowerkscontributed
articlesFEBRUARY2010| VOL. 53| NO. 2| COMMUNI CATI
ONSOFTHEACM73ndingtoolwilluncoversomeweird
examples.Givenenoughcoders,
youllseethesamething.Thefol-lowingutteranceswereculledfrom trial
meetings: Uponseeinganerrorreportsaying the following loop body was
dead code f oo( i = 1; i < 0; i ++). . . deadcode. . .No, thats
a false positive; a loop ex-ecutes at least once.
Forthismemorycorruptionerror (32-bit machine) i nt a[2], b;memset(
a, 0, 12);No, I meant to do that; they are next to each other. For
this use-after-free f r ee( f oo);f oo- >bar = . . .
;No,thatsOK;thereisnomalloc call between the free and use. As a nal
example, a buffer overow checker agged a bunch of errors of the
form unsi gnedp[4];. . .p[4] = 1;No,ANSIletsyouwrite1pastthe end of
the array. After heated argument, the
program-mersaid,Wellhavetoagreetodis-agree.Wecouldagreeaboutthedis-agreement,thoughwecouldntquite
comprehendit.The(subtle?)interplay between 0-based offsets and
buffer siz-es seems to come up every few months.
Whileprogrammersarenotoften soegregiouslymistaken,thegeneral
trendholds;anot-understoodbug reportiscommonlylabeledafalse
positive,ratherthanspurringthepro-grammertodelvedeeper.Theresult?
Wehavecompletelyabandonedsome
analysesthatmightgeneratedifcult-to-understand reports. How to
handle cluelessness. You can-notoftenarguewithpeoplewhoare
sufcientlyconfusedabouttechnical matters;theythinkyouaretheone who
doesnt get it. They also tend to get
emotional.Arguingreliablykillssales. What to do? One trick is to
try to orga-nizealargemeetingsotheirpeersdo
theworkforyou.Themorepeoplein the room, the more likely there is
some-one very smart and respected and cares (about bugs and about
the given code), can diagnose an error (to counter
argu-mentsitsafalsepositive),hasbeen burned by a similar error,
loses his/her bonus for errors, or is in another group (another
potential sale). Further,alargerresultsmeeting
increasestheprobabilitythatanyone laidoffatalaterdateattendeditand
sawhowyourtoolworked.Truestory: Anetworkingcompanyagreedtobuy
theCoverityproduct,andoneweek later laid off 110 people (not
because of us).Goodorbad?Fortheredpeople
itclearlywasntahappyday.However, ithadasurprisingresultforusata
business level; when these people were
hiredatothercompaniessomesug-gestedbringingthetoolinforatrial,
resulting in four sales. Whathappenswhenyoucantx
allthebugs?Ifyouthinkbugsarebad enoughtobuyabug-ndingtool,you will
x them. Not quite. A rough heuris-ticisthatfewerthan1,000bugs,then
xthem.More?Thebaselineistore-cordthecurrentbugs,dontxthem but do x
any new bugs. Many compa-nies have independently come up with
thispractice,whichismorerational than it seems. Having a lot of
bugs usu-allyrequiresalotofcode.Muchofit
wonthavechangedinalongtime.A reasonable,conservativeheuristicis
ifyouhaventtouchedcodeinyears, dontmodifyit(evenforabugx)to avoid
causing any breakage. Asurprisingconsequenceisitsnot uncommon for
tool improvement to be viewedasbadoratleastaproblem. Pretend you
are a manager. For anything bad you can measure, you want it to
di-minishovertime.Thismeansyouare improving something and get a
bonus.
Youmaynotunderstandtechni-calissuesthatwell,andyourbosscer-tainly
doesnt understand them. Thus, you want a simple graph that looks
like Figure1;nomanagergetsabonusfor Figure 2. Representative story:
At com-panyX,version2.4ofthetoolfound
approximately2,400errors,andover time the company xed about 1,200
of them. Then it upgraded to version 3.6.
Suddenlytherewere3,600errors.The managerwasfuriousfortworeasons:
One, we undid all the work his people its notuncommon fortool
improvementto be viewedas bad or atleast a problem.74COMMUNI CATI
ONSOFTHEACM| FEBRUARY2010| VOL. 53| NO. 2contributed articleshad
done, and two, how could we have missed them the rst time?
Howdoupgradeshappenwhen
morebugsisnogood?Companiesin-dependentlysettleonasmallnumber of
upgrade models: Never. Guarantees improvement;
Neverbeforearelease(whereitwould be most crucial).
Counterintuitively
hap-pensmostoftenincompaniesthatbe-lievethetoolhelpswithreleasequality
in that they use it to gate the release; Never before a meeting.
This is at least socially rational; Upgrade, then roll back. Seems
to hap-penatleastonceatlargecompanies; and
Upgradeonlycheckerswheretheyx mosterrors.Commoncheckersinclude
use-after-free,memorycorruption, (sometimes)locking,and(sometimes)
checkers that ag code contradictions.Domissederrorsmatter?Ifpeople
dontxallthebugs,domissederrors (falsenegatives)matter?Ofcoursenot;
theyareinvisible.Well,notalways. Commoncases:Potentialcustomers
intentionallyintroducedbugsintothe system, asking Why didnt you nd
it? Manycheckifyoundimportantpast bugs. The easiest sale is to a
group whose code you are checking that was horribly burned by a
specic bug last week, and you nd it. If you dont nd it? No mat-ter
the hundreds of other bugs that may be the next important bug. Here
is an open secret known to bug nders:Thesetofbugsfoundbytool
AisrarelyasupersetofanothertoolB, evenifAismuchbetterthanB.Thus,
thediscussiongetspushedfromAis betterthanBtoAndssomethings, B nds
some things and does not help the case of A.
Addingbugscanbeaproblem;los-ingalreadyinspectedbugsisalwaysa
problem, even if you replace them with manymorenewerrors.Whileusers
knowintheorythatthetoolisnota verier, its very different when the
tool demonstrates this limitation, good and hard, by losing a few
hundred known er-rors after an upgrade. The easiest way to lose
bugs is to add justonetoyourtool.Abugthatcauses false negatives is
easy to miss. One such buginhowourearlyresearchtools
internalrepresentationhandledarray
referencesmeanttheanalysisignored mostarrayusesformorethannine
months.Inourcommercialproduct,
blatantsituationslikethisareprevent-ed through detailed unit
testing, but un-covering the effect of subtle bugs is still difcult
because customer source code is complex and not available. Churn
Usersreallywantthesameresultfrom runtorun.Eveniftheychangedtheir
code base. Even if they upgraded the tool. Their model of error
messages? Compil-er warnings. Classic determinism states: the same
input + same function = same result. What users want: different
input
(modiedcodebase)+differentfunc-tion(toolversion)=sameresult.Asa
result, we nd upgrades to be a constant
headache.Analysischangescaneasily causethesetofdefectsfoundtoshift.
The new-speak term we use internally is churn. A big change from
academia is that we spend considerable time and en-ergy worrying
about churn when modify-ing checkers. We try to cap churn at less
than5%perrelease.Thisgoalmeans
largeclassesofanalysistricksaredisal-lowed since they cannot
obviously guar-antee minimal effect on the bugs found.
Randomizationisverboten,atragedy given that it provides simple,
elegant so-lutions to many of the exponential
prob-lemsweencounter.Timeoutsarealso bad and sometimes used as a
last resort but never encouraged. Myth:Moreanalysisisalwaysgood.
Whilenondeterministicanalysismight causeproblems,itseemsthatadding
moredeterministicanalysisisalways good. Bring on path sensitivity!
Theorem proving! SAT solvers! Unfortunately, no. At the most basic
level, errors found with little analysis are often better than
errors found with deeper tricks. A good error is probable, a true
error, easy to di-agnose; best is difcult to misdiagnose. As the
number of analysis steps increas-es,so,too,doesthechanceofanalysis
mistake,userconfusion,ortheper-ceived improbability of event
sequence. No analysis equals no mistake.
Further,explainingerrorsisoften moredifcultthanndingthem.A
misunderstoodexplanationmeansthe errorisignoredor,worse,transmuted
intoafalsepositive.Theheuristicwe follow: Whenever a checker calls
a com-plicated analysis subroutine, we have to explain what that
routine did to the user, and the user will then have to (correctly)
manuallyreplicatethattrickythingin his/her head.
Sophisticatedanalysisisnoteasyto
explainorredomanually.Compound-ingtheproblem,usersoftenlacka
stronggrasponhowcompilerswork. ArepresentativeuserquoteisStatic
analysis? Whats the performance over-head?
Theendresult?Sincetheanalysis
thatsuppressesfalsepositivesisinvis-ible(itremoveserrormessagesrather
than generates them) its sophistication has scaled far beyond what
our research Figure 1. Bugs down overtime = manager bonus.
timebadtimebadFigure 2. No bonus. contributed articlesFEBRUARY2010|
VOL. 53| NO. 2| COMMUNI CATI
ONSOFTHEACM75systemdid.Ontheotherhand,the
commercialCoverityproduct,despite
itsimprovements,lagsbehindthere-searchsysteminsomewaysbecauseit had
to drop checkers or techniques that demandtoomuchsophisticationon
thepartoftheuser.Asanexample,for many years we gave up on checkers
that aggedconcurrencyerrors;whilend-ing such errors was not too
difcult, ex-plainingthemtomanyuserswas.(The
PRExsystemalsoavoidedreporting races for similar reasons though is
now supported by
Coverity.)Nobugistoofoolishtocheckfor.Giv-enenoughcode,developerswillwrite
almostanythingyoucanthinkof.Fur-ther,completelyfoolisherrorscanbe
some of the most serious; its difcult to be extravagantly
nonsensical in a harm-lessway.Wevefoundmanyerrorsover
theyears.Oneoftheabsolutebestwas the following in the X Window
System: i f(get ui d()!=0 &&geteui d ==0){Er r or F(onl yr
oot );exi t(1);}Itallowedanylocalusertogetroot
accessdandgeneratedenormouspress coverage,includingamentiononFox
news(theWebsite).Thecheckerwas written by Scott McPeak as a quick
hack to get himself familiar with the system. It made it into the
product not because of a perceived need but because there was no
reason not to put it in. Fortunately. False Positives False
positives do matter. In our experi-ence, more than 30% easily cause
prob-lems. People ignore the tool. True bugs get lost in the false.
A vicious cycle starts wherelowtrustcausescomplexbugs to be labeled
false positives, leading to yet lower trust. We have seen this
cycle triggered even for true errors. If people dont understand an
error, they label it false. And done once, induction makes
the(n+1)thtimeeasier.Weinitially
thoughtfalsepositivescouldbeelimi-natedthroughtechnology.Becauseof
this dynamic we no longer think so. Wevespentconsiderabletechnical
dThetautologicalcheckget eui d== 0wasin-tendedtobeget eui d( ) ==
0.Initscurrent form, it compares the address of get eui dto 0;
giv-en that the function exists, its address is never
0.periencecoveredherewastheworkof many. We thank all who helped
build the toolandcompanytoitscurrentstate,
especiallythesalesengineers,support
engineers,andservicesengineerswho took the product into complex
environ-mentsandwereoftenthersttobear
thebruntofproblems.Withoutthem
therewouldbenocompanytodocu-ment.Weespeciallythankallthecus-tomerswhotoleratedthetoolduring
itstransitionfromresearchqualityto productionqualityandthenumerous
championswhoseinsightfulfeedback helped us focus on what
mattered.References 1.Ball, T. and Rajamani, S.K. Automatically
validating temporal safety properties of interfaces. In Proceedings
of the Eighth international SPIN Workshop on Model Checking of
Software (Toronto, Ontario, Canada). M. Dwyer, Ed. Springer-Verlag,
New York, 2001, 103122.2.Bush, W., Pincus, J., and Sielaff, D. A
static analyzer for nding dynamic programming errors. Software:
Practice and Experience 30, 7 (June 2000), 775802. 3.Coverity
static analysis; http://www.coverity.com 4.Das, M., Lerner, S., and
Seigle, M. ESP: Path-sensitive program verication in polynomial
time. In Proceedings of the ACM SIGPLAN 2002 Conference on
Programming Language Design and Implementation (Berlin, Germany,
June 1719). ACM Press, New York, 2002, 5768. 5.Edison Design Group.
EDG C compiler front-end; http://www.edg.com 6.Engler, D., Chelf,
B., Chou, A., and Hallem, S. Checking system rules using
system-specic, programmer-written compiler extensions. In
Proceedings of the Fourth Conference on Operating System Design
& Implementation (San Diego, Oct. 2225). USENIX Association,
Berkeley, CA, 2000, 11. 7.Flanagan, C., Leino, K.M., Lillibridge,
M., Nelson, G., Saxe, J.B., and Stata, R. Extended static checking
for Java. In Proceedings of the ACM SIGPLAN Conference on
Programming Language Design and Implementation (Berlin, Germany,
June 1719). ACM Press, New York, 2002, 234245. 8.Foster, J.S.,
Terauchi, T., and Aiken, A. Flow-sensitive type qualiers. In
Proceedings of the ACM SIGPLAN 2002 Conference on Programming
Language Design and Implementation (Berlin, Germany, June 1719).
ACM Press, New York, 2002, 112. 9.Hallem, S., Chelf, B., Xie, Y.,
and Engler, D. A system and language for building system-specic,
static analyses. In Proceedings of the ACM SIGPLAN Conference on
Programming Language Design and Implementation (Berlin, Germany,
June 1719). ACM Press, New York, 2002, 6982. 10.Hastings, R. and
Joyce, B. Purify: Fast detection of memory leaks and access errors.
In Proceedings of the Winter 1992 USENIX Conference (Berkeley, CA,
Jan. 2024). USENIX Association, Berkeley, CA, 1992, 125138. 11.Xie,
Y. and Aiken, A. Context- and path-sensitive memory leak detection.
In Proceedings of the 10th European Software Engineering Conference
Held Jointly with 13th ACM SIGSOFT International Symposium on
Foundations of Software Engineering (Lisbon, Portugal, Sept. 59).
ACM Press, New York, 2005, 115125. Al Bessey, Ken Block, Ben Chelf,
Andy Chou,Bryan Fulton, Seth Hallem, Charles Henri-Gros,Asya
Kamsky, and Scott McPeak are current or former employees of
Coverity, Inc., a software company basedin San Francisco, CA.;
http://www.coverity.com Dawson Engler ([email protected]) is an
associate professor in the Department of Computer Science and
Electrical Engineering at Stanford University, Stanford, CA, and
technical advisor to Coverity, Inc., San Francisco, CA. 2010 ACM
0001-0782/10/0200 $10.00effort to achieve low false-positive rates
inourstaticanalysisproduct.Weaim forbelow20%forstablecheckers.
Whenforcedtochoosebetweenmore bugsorfewerfalsepositiveswetypi-cally
choose the latter. Talking about false positive rate is
simplisticsincefalsepositivesarenot
allequal.Theinitialreportsmatterin-ordinately; if the rst N reports
are false positives(N=3?),peopletendtoutter
variantsonThistoolsucks.Further-more,youneverwantanembarrass-ingfalsepositive.Astupidfalseposi-tive
implies the tool is stupid. (Its not even smart enough to gure that
out?) Thistechnicalmistakecancauseso-cial problems. An expensive
tool needs someone with power within a company
ororganizationtochampionit.Such peopleoftenhaveatleastoneenemy. You
dont want to provide ammunition that would embarrass the tool
champi-on internally; a false positive that ts in a punchline is
really bad. Conclusion Whilewevefocusedonsomeofthe
less-pleasantexperiencesinthecom-mercializationofbug-ndingprod-ucts,twopositiveexperiencestrump
themall.First,sellingastatictoolhas
becomedramaticallyeasierinrecent years. There has been a seismic
shift in terms of the average programmer get-ting it. When you say
you have a static bug-nding tool, the response is no
lon-gerHuh?orLint?Yuck.Thisshift seems due to static bug nders
being in wideruse,givingrisetonicenetwork-ing effects. The person
you talk to likely knows someone using such a tool, has a
competitor that uses it, or has been in a company that used it.
Moreover,whileseeminglyvacuous tautologieshavehadanegativeeffect
ontechnicaldevelopment,anicebal-ancingempiricaltautologyholdsthat
bugndingisworthwhileforanyone withaneffectivetool.Ifyoucannd
code,andthecheckedsystemisbig enough,andyoucancompile(enough
of)it,thenyouwillalwaysndserious errors. This appears to be a law.
We en-courage readers to exploit it. Acknowledgments We thank Paul
Twohey, Cristian Cadar, and especially Philip Guo for their
help-ful,last-minuteproofreading.Theex-