Top Banner
Reproducible Computational Workflows with Continuous Analysis Brett K. Beaulieu-Jones 1 , and Casey S. Greene 2,+ 1 Genomics and Computational Biology Graduate Group. Perelman School of Medicine. University of Pennsylvania. 2 Department of Systems Pharmacology and Translational Therapeutics. Perelman School of Medicine. University of Pennsylvania + Corresponding Author email: [email protected] address: 3400 Civic Center Blvd. 10-131 SCTR Philadelphia, PA 19103 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted August 11, 2016. ; https://doi.org/10.1101/056473 doi: bioRxiv preprint
17

Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

Aug 08, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

ReproducibleComputationalWorkflowswithContinuousAnalysis

BrettK.Beaulieu-Jones1,andCaseyS.Greene2,+

1GenomicsandComputationalBiologyGraduateGroup.PerelmanSchoolofMedicine.UniversityofPennsylvania.2DepartmentofSystemsPharmacologyandTranslationalTherapeutics.PerelmanSchoolofMedicine.UniversityofPennsylvania+CorrespondingAuthoremail:[email protected]:3400CivicCenterBlvd.10-131SCTRPhiladelphia,PA19103

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 2: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

Abstract

Reproducingexperimentsisvitaltoscience.Beingabletoreplicate,validateandextendpreviousworkalsospeedsnewresearchprojects.Reproducingcomputationalbiologyexperiments,whicharescripted,shouldbestraightforward.Butreproducingsuchworkremainschallengingandtimeconsuming.Intheidealworldwewouldbeabletoquicklyandeasilyrewindtotheprecisecomputingenvironmentwhereresultsweregenerated.Wewouldthenbeabletoreproducetheoriginalanalysisorperformnewanalyses.Weintroduceaprocesstermed“continuousanalysis”whichprovidesinherentreproducibilitytocomputationalresearchataminimalcosttotheresearcher.ContinuousanalysiscombinesDocker,acontainerservicesimilartovirtualmachines,withcontinuousintegration,apopularsoftwaredevelopmenttechnique,toautomaticallyre-runcomputationalanalysiswheneverrelevantchangesaremadetothesourcecode.Thisallowsresultstobereproducedquickly,accuratelyandwithoutneedingtocontacttheoriginalauthors.Continuousanalysisalsoprovidesanaudittrailforanalysesthatusedatawithsharingrestrictions.Thisallowsreviewers,editors,andreaderstoverifyreproducibilitywithoutmanuallydownloadingandrerunninganycode.Exampleconfigurationsareavailableatouronlinerepository(https://github.com/greenelab/continuous_analysis).

TheCurrentStateofReproducibility

Leadingscientificjournalshavetargetedreproducibilitytoincreasereaders’

confidenceinresultsandreduceretractions1–5.Inarecentsurvey,90%ofresearchersacknowledgedareproducibilitycrisis6.Researchthatusescomputationalprotocolsshouldbeparticularlyamenabletoreproducibleworkflowsbecauseallofthestepsarescriptedintoamachine-readableformat.Butwrittendescriptionsofcomputationalapproachescanbedifficulttounderstandandmaylackrequiredparameters.Evenwhenresultscanbereproduced,theprocessoftenrequiresasubstantialtimeinvestmentandhelpfromtheoriginalauthors.Garijoetal.7estimateditwouldtake280hoursforanon-experttoreproduceapaperdescribingacomputationalconstructionofadrug-targetnetworkforMycobacteriumtuberculosis8.Thesearethegoodscenarios:theresultsbehindmostcomputationalpapersarenotreadilyreproducible7,9–11.

Thepracticeof“openscience”hasbeendiscussedasameanstoaidreproducibility3,12.Inopensciencethedataandsourcecodeareshared.Sharingcanalsoextendtointermediateresultsandprojectplanning13.Sharingdataandsourcecodeiscurrentlynecessarybutnotsufficienttomakeresearchreproducible.Evenwhencodeanddataareshared,itremainsdifficulttoreproduceresultsduetodifferingcomputingenvironments,operatingsystems,librarydependenciesetc.Itiscommontouseoneormoreopensourcelibrariesonaproject,andresearchcodequicklybecomesdependentonoldversionsoftheselibrariesassoftwareadvances14.Theseoldorbrokendependenciesmakeitdifficultforreadersand

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 3: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

reviewerstorecreatetheenvironmentoftheoriginalresearchers,whethertovalidateorextendtheirwork. Anexampleofwheresharingdatadoesnotautomaticallymakesciencereproducibleoccursinthemoststandardofplaces:differentialgeneexpressionanalysis.Suchanalysesareroutine.Ourunderstandingofthegenome,includingtranscriptomeannotations,haveimprovedandupdatedprobesetdefinitionsareavailable15.Analysesrelyingonunspecifiedprobesetdefinitionscannotbereproducedusingcurrentdefinitions.

WeanalyzedthefifteenmostrecentlypublishedpapersthatciteDaietal.,acommonsourceforcustomchipdescriptionfiles(CustomCDF),thatwereaccessibleatourinstitution16–31.WeidentifiedthesemanuscriptsusingWebofScienceonMay31,2016.WerecordedthenumberofpapersthatcitedaversionofCustomCDF,aswellaswhichversionwascited.Weexpectthisanalysistoprovideanupperboundonreproduciblework:thesepapersspecificallycitedthesourceoftheirCDFs.Ofthesefifteenpapers,nine(60%)specifiedwhichversiontheyused.Thesenineusedversions11,15,16,17,18,and19oftheBrainArrayCustomCDF.

Thisinitialanalysiswasperformedbasedonarticlerecencywithoutregardtoarticleimpact.Todeterminetheextenttowhichthisissueaffectshighimpactpapers,weperformedaparallelevaluationforthetenmostcitedpapers32–41thatciteDaietal.WedeterminedthetenmostcitedpapersusingWebofScienceonMay31,2016.Ofthesetenpapers,one38(10%)specifiedwhichversionoftheCustomCDFwasused.Thatpaperusedversion11oftheBrainArrayCustomCDF. Wesoughttodeterminewhichversionswerecurrentlyinuseinthefield.Weaskedthreeindividualswhoperformedmicroarrayanalysisrecently,andweaccessedandevaluatedtwoclustersystemsusedforprocessingdata.Wefoundthateachindividualhadoneofthethreemostrecentlyreleasedversionsinstalled(18,19,and20),andversions18and19wereinstalledonclustersystems. ToevaluatetheimpactofdifferingCDFversions,wedownloadedarecentlypublishedpublicgeneexpressiondataset(GEOSeriesAscensionnumberGSE47664).ThisexperimentexamineddifferentialexpressionbetweennormalHeLacellsandHeLacellswithTIA1andTIARknockeddown42.Weperformedaparallelanalysisusingeachofthethreeversionsthatwefoundinstalledonmachinesthatwecouldaccess(18,19,and20).Eachversionidentifiesadifferentnumberofsignificantlyalteredgenes(Figure1A),demonstratingthechallengeofreproducibleanalysis.WesimulatedaparallelanalysisofdifferentialexpressionusingDockercontainersonmismatchedmachines43.ThisspecifiestheCDFversionandproducesthesamenumberandsetofdifferentiallyexpressedgenesforagivenversionacrossmachines(v18exampleinFigure1B).HadcontinuousanalysisbeenusedforpaperscitingtheBrainArrayCustomCDFtheircomputationalresultswouldbeeasilyreplicated.

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 4: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

Figure1.Currentstateofresearchcomputingvs.container-basedapproaches.A.)Thestatusquorequiresareaderorreviewertofindandinstallspecificversionsofdependencies.Thesedependenciescanbecomedifficulttofindandmaybecomeincompatiblewithnewerversionsofothersoftwarepackages.Differentversionsofpackagesidentifydifferentnumbersofsignificantlydifferentiallyexpressedgenesfromthesamesourcecodeanddata.B.)Containersdefineacomputingenvironmentthatcapturesdependencies.Incontainer-basedsystems,theresultsarethesameregardlessofthehostsystem.

ContinuousAnalysisinComputationalWorkflows

Wedevelopedcontinuousanalysistoproduceaverifiableend-to-endrunofcomputationalresearchwithminimalstart-upcosts.Incontrastwiththestatusquo,continuousanalysispreservesthecomputingenvironmentandmaintainstheversionsofdependencies.Wedescribedthebenefitsofcontainerizedapproachesabove,butmaintaining,runninganddistributingDockerimagesmanuallywouldbecometimeconsuming.IntegratingDockerintoacontinuousscientificanalysispipelinemeetsthreecriteria:(1)anyonecanre-runcodeinacomputingenvironmentmatchingtheoriginalauthors(SupplementalFigure1);(2)readersandreviewerscanfollowexactlywhatwasdoneinan“audit”fashionwithoutrunningcode(SupplementalFigure2&3);and(3)thesolutionimposeszeroto

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 5: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

Figure2.Continuousanalysiscanbesetupinthreeprimarysteps(numbered1,2,and3).(1)TheresearchercreatesaDockercontainerwiththerequiredsoftware.(2)TheresearcherconfiguresacontinuousintegrationservicetousethisDockerimage.(3)Theresearcherpushescodethatincludesascriptcapableofrunningtheanalysesfromstarttofinish.ThecontinuousintegrationproviderrunsthelatestversionofcodeinthespecifiedDockerenvironmentwithoutmanualintervention.ThisgeneratesaDockercontainerwithintermediateresultsthatallowsanyonetorerunanalysisinthesameenvironment,producesupdatedfigures,andstoreslogsdescribingeverythingthatoccurred.Exampleconfigurationsareavailableinthesupplementarymaterialsaswellasouronlinerepository(https://github.com/greenelab/continuous_analysis).Becausecodeisruninanindependent,reproduciblecomputingenvironmentandproducesdetailedlogsofwhatwasexecuted,thispracticereducesoreliminatestheneedforreviewerstore-runcodetoverifyreproducibility.

minimalcostintermsoftimeandmoneyontheresearcher,dependingontheircurrentresearchprocess. Continuousanalysisextendscontinuousintegration44,acommonpracticeinsoftwaredevelopmentanddeployment.Continuousintegrationisasoftwaredevelopmentworkflowthattriggersanautomatedbuildprocesswheneverdeveloperschecktheircodeintoasourcecontrolrepository.Thisautomatedbuildprocessrunstestscriptsiftheyexist.Thesetestscancatchbugsintroducedintosoftware.Softwarethatpassestestsisautomaticallydeployedtoremoteservers.

Forcontinuousanalysis(Figure2),werepurposetheseservicesinordertoruncomputationalanalyses,updatefigures,andpublishchangestoonlinerepositorieswheneverrelevantchangesaremadetothesourcecode.Whenanauthorisreadytoreleasecodeorpublishtheirworktheycanexportthemostrecentcontinuousintegrationrun.Becausethisprocessgeneratesresultsinacleanandclearlydefinedcomputingenvironmentwithoutmanualintervention,reviewerscanbeconfidentthattheanalysesarereproducible.

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 6: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

Ineachprojectwemaintaindependencieswiththefreeopen-source

softwaretoolDocker45.Dockerdefinesan“image”thatallowsuserstodownloadandrunacontainer,aminimalistvirtualmachinewithapredefinedcomputingenvironment.Dockerimagescanbeseveralgigabytesinsize,butoncedownloadedcanbestartedinamatterofsecondsandhasminimaloverhead14.Inaddition,Dockerimagescanbeeasilytaggedtocoincidewithsoftwarereleasesandpaperrevisions.Atthetimeofsubmission,authorscanrunthe`dockersave`commandtoexportastaticfilethatcanbeuploadedtoservicessuchasFigshareorZenodotoreceiveaDOI.Forexample,wehaveuploadedourcontinuousanalysisenvironmentfortheexamplesinthispaper46.

Tosetupcontinuousanalysis,aresearcherneedstodothreethings.First

theymustcreateaDockerfile,whichspecifiesalistofdependencies.Second,theyneedtoconnectacontinuousintegrationservicetotheirversioncontrolsystemandprovidethecommandstoruntheiranalysis.Finally,theyneedtocommitandpushchangestotheirversioncontrolsystem.Manyresearchersalreadyperformthefirstandthirdtasksintheirstandardworkflow.

Thecontinuousintegrationsystemwillautomaticallyrerunthespecified

analysiswitheachchange,preciselymatchingthesourcecodeandresults.Itcanalsobesettolistenandrunonlywhenchangesaremarkedinaspecificway,e.g.bycommittingtoaspecific‘staging’branch.Forthefirstproject,thisprocesscanbeputintoplaceinlessthanaday.Forsubsequentprojects,thiscanbedoneinunderanhour.SettingupContinuousAnalysis

WehavecreatedaGitHubrepositorywithinstructionsforpaid,local,and

cloud-basedcontinuousanalysissetups47.Thesearefullydetailedinthesupplementarymaterialsandonlinerepository.HerewedescribehowcontinuousanalysiscanbesetupusingthefreeandopensourceDronesoftwareonaresearcher’spersonalcomputerandconnectedtotheGitHubversioncontrolservice.Thissetupisfreetousers.

1. InstallDockeronthecomputer.2. PulltheDroneimageviadocker:

sudodockerpulldrone/drone:latest

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 7: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

Figure3.RegisteranewapplicationfortheDronecontinuousintegrationserver.SetthehomepageURLtobetheIPaddressoftheDronecomputer.SetthecallbackURLtothesameIPaddressfollowedby/authorize.

3. CreateanewapplicationinGitHub(Figure3).

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 8: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

Figure4.RegisteranewapplicationfortheDronecontinuousintegrationserver.ThepayloadURLshouldbeintheformatofyour-ip/api/hook/github.com/client-id

4. AddawebhooktotheGitHubproject(Figure4).Thiswillnotifythecontinuousintegrationserverofanyupdatespushedtotherepository.

5. CreateaconfigurationfileontheDronecomputerat/etc/drone/dronerc

fillingintheclientinformationprovidedbyGitHubREMOTE_DRIVER=githubREMOTE_CONFIG=https://github.com?client_id=....&client_secret=....

6. Runthedronecontainer

sudodockerrundrone/drone:latest

Continuousanalysiscanbeperformedwithdozensoffullserviceproviders

oraprivateinstallationonalocalmachine,clusterorcloudservice47.Fullserviceproviderscanbesetupinminutesbutmayhavecomputationalresourcelimitsormonthlyfees.Privateinstallationsrequireconfigurationbutcanscaletoalocalclusterorcloudservicetomatchthecomputationalcomplexityofallwalksofresearch.Withfree,open-sourcecontinuousintegrationsoftware48,computingresourcesaretheonlyassociatedcosts.UsingContinuousAnalysis Aftersetup,runningcontinuousanalysisissimpleandfitsintoexistingresearchworkflowsthatusesourcecontrolsystems.Wehaveusedcontinuous

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 9: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

analysisinourownwork49.Wehavealsopreparedthreeexamplerepositories(detailedinsupplementalmaterials):

1. Anexampledemonstratingthesetupofcontinuousanalysiswithawidevarietyofservicesandconfigurations(highlightedbelow).

2. Aneasytofollowbasicphylogenytreebuildingexample,combiningsequencealignmentusingMAFFT50,formatconversionusingEMBOSSSeqret51,andtreecalculationanddrawingusing.

3. AnRNAexpressionanalysisworkflowexaminingorganoidmodelsofpancreaticcancerinmicebasedonworkfromBojetal.52usingdetailsandsourcecodepublishedbyBalli53.Thisexampleshowstheabilityofcontinuousanalysistoscaletolargecomputations.Thisexampleuseskallisto54,limma55,56,andsleuth57toanalyze150GBofgeneexpressiondataandapproximately480millionreads.

Todemonstratethesetupprocessanddifferentconfigurationsofcontinuousanalysisweshowasimpleexampleofcontinuousanalysiswithkallisto.TherecentlypublishedsoftwaretoolkallistoquantifiestranscriptabundanceinRNA-seqdata.Ourexamplere-runstheexamplesprovidedinkallistowitheachcommittoarepository.

1. Addascriptfiletore-runcustomanalysis.ForDrone,thisisa.drone.ymlfilethatspecifiescommandstoruneachstepoftheanalysis.AnexampleconfigurationisavailableinthecontinuousanalysisGitHubrepositoryaswellasthesupplementalmaterials.

2. Commitchangestothesourcecontrolrepository.3. PushchangestoGitHub.

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 10: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

Figure5.AuditlogsfromacontinuousintegrationrunwiththeserviceShippableforthekallistoexample. Theconfiguredcontinuousintegrationserviceautomaticallyrunsthespecifiedscript.Weconfiguredthistoreruntheanalysis,regeneratethefigures,andcommitupdatedversionstotherepository.Theserviceprovidesacompleteauditlogofwhatwasruninthecleancontinuousintegrationenvironment(Figure5).Bygeneratingandpushingupdatedfigures,thisprocessalsogeneratesacompletechangelogforeachresult(Figure6).Interactivedevelopmenttools,suchasJupyter58,59,RMarkdown60,61andSweave62canbeincorporatedtopresentthecodeandanalysisinalogicalgraphicalmanner.Forexample,werecentlyusedJupyterwithcontinuousanalysisinourownpublication63andcorrespondingrepository49.

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 11: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

Figure6.ResultingfiguresfromtherunarecommittedbacktoGithubwherechangesbetweenrunscanbeviewed.A.)Theeffectofaddinganadditionalgene(HumanTw2)toaphylogenetictree-buildingexample.B.)Theeffectofaddinganadditionalgene(mt8)toanRNA-seqdifferentialexpressionexperimentPCAplot.

Insummary,continuousanalysisprovidestheresultsofaverifiableend-to-

endrunina“clean”environment.Becausecontinuousanalysisrunsautomaticallyinthebackground,notransitionisneededbetweentheexplorationandpublicationphasesofascientificproject.Theaudittrailprovidedbycontinuousanalysisallowsreviewersandeditorstoprovidesoundjudgmentonreproducibilitywithoutalargetimecommitment.Ifreadersorreviewerswouldliketore-runthecodeontheirown(e.g.tochangeaparameterandevaluatetheimpactonresults),theycaneasilydosowiththeDockercontainercontainingthefinalcomputingenvironmentandintermediateresults.Versioncontrolsystemsprovidethecapabilitytowatchforupdates.Readerscan“star”or“watch”arepositoryonservicessuchasGithub,Gitlab,andBitbuckettobeautomaticallynotifiedofchangesandupdatedruns.Wideadoptionofthesesystemsthroughoutthepublicationprocesscouldallowreviewersandeditorstoautomaticallybenotifiedofupdatedresults. Continuousanalysisprovidesanaudittrailforreproducibleanalysesofcloseddata. Continuousanalysiscanbeevenmorepowerfulwhenworkingwithcloseddatathatcannotbereleased.Withoutcontinuousanalysis,reproducing

C. D.

AdditionalSample

A. B.AdditionalSample

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 12: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

computationalanalysesbasedoncloseddataisdependentontheoriginalauthorscompletelyandexactlydescribingeachstep,aprocessthatmaybeanafterthoughtandrelegatedtoextendedmethods.Readersmustthendiligentlyfollowcomplexwritteninstructionswithoutintermediateconfirmationtheyareontherighttrack.Thecontainersproducedduringcontinuousanalysisincludeamatchingenvironmentforreplicationaswellasintermediateresults.Thisallowsreaderstodeterminewheretheirresultsdivergefromtheoriginalworkandtodeterminewhetherdivergenceisduetosoftware-basedordata-baseddifferences.Bestpracticeswithcontinuousanalysis Wesuggestadevelopmentworkflowwherecontinuousanalysisrunsonlyonasinglebranch(SupplementalFigure4).Researcherscanpushtothisbranchwhentheybelievetheyarereadyforareleasetoavoidrunningthefullprocessduringincompleteupdates.Iftheupdatestothisbranchsucceed,thechangesarethenautomaticallycarriedovertothemasterorproductionbranchandreleased.WerecommendexportingboththebeforeandafterprocessingDockerimagesanduploadingtoanarchivalservicelikeFigshareorZenodo.Thearchivedimagescanthenbecitedtoguidereaderstotheversionusedinthemanuscript46.Forconvenience,theimagescanalsobesharedthroughtheDockerHubregistry. Itmaycurrentlybeimpracticaltousecontinuousanalysisforgenericpreprocessingstepsinvolvingverylargedataoranalysesrequiringparticularlyhighcomputationalcosts.Inparticular,stepsthattakedaystorunorincursubstantialcostsincomputationalresourcesmaynotbeamenablewithexistingproviders64.Oneday,continuousanalysissystemsspecificallydesignedforscientificworkflowsmayfacilitatereproducibleworkflowsinthesesettings.Fornow,researchersmayneedtousediscretionwhenpreprocessingviacontinuousanalysis,asitmaybecomputationallyintractabletoreanalyzeaftereachcommittoastagingbranch.Researchersmayelecttorunonlythefinalworkflowthroughthisprocess,ormayelecttoemploycontinuousanalysisafterstandardbutcomputationallyexpensivepreprocessingstepsarecompleted. Forsmalldatasetsandlessintensivecomputationalworkflowsitiseasiesttouseafullservicecontinuousintegrationservice.Theseserviceshavethesmallestsetuptimes.Withprivatedataorwhendatasizeandcomputationalcomplexityscaleitbecomesnecessarytosetupalocalprivatelyhostedcontinuousintegrationserver.Clusterorcloudbasedcontinuousintegrationserverscanhandlethelargestworkflows.Theimpactofreproduciblecomputationalresearch

Reproducibilitycanhavewide-reachingbenefitsfortheadvancementofscience.Forauthors,easilyreproducibleworkisasignofqualityandcredibility.Continuousanalysisaddressesthereproducibilityofcomputationallyanalysesinthenarrowsense:generatingthesameresultsfromthesameinputs.Itdoesnot

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 13: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

solvereproducibilityinthebroadersense:howrobustresultsaretoparametersettings,startingconditionsandpartitionsinthedata.Continuousanalysislaysthegroundworkneededtoaddressreproducibilityandrobustnessoffindingsinthebroadsense.AcknowledgementsThisworkwassupportedbytheGordonandBettyMooreFoundationunderaDataDrivenDiscoveryInvestigatorAwardtoCSG(GBMF4552)andsupportedbyaCommonwealthUniversalResearchEnhancement(CURE)ProgramgrantfromthePennsylvaniaDepartmentofHealth.WewouldliketothankDavidBalliforprovidingtheRNA-seqanalysisdesign,KatieSiewertforprovidingthephylogeneticanalysisdesign,andAlexWhanforcontributingaTravis-CIimplementation.References1. Rebootingreview.NatBiotech.2015;33(4):319.

http://dx.doi.org/10.1038/nbt.3202.2. Softwarewithimpact.NatMeth.2014;11(3):211.

http://dx.doi.org/10.1038/nmeth.2880.3. PengRD.ReproducibleResearchinComputationalScience.Science(80-).

2011;334(6060):1226-1227.doi:10.1126/science.1213847.4. McNuttM.Reproducibility.Science(80-).2014;343(6168):229.

http://science.sciencemag.org/content/343/6168/229.abstract.5. Illuminatingtheblackbox.Nature.2006;442(7098):1.

http://dx.doi.org/10.1038/442001a.6. BakerM.1,500scientistsliftthelidonreproducibility.Nature.

2016;533(7604):452-454.doi:10.1038/533452a.7. GarijoD,KinningsS,XieLL,etal.Quantifyingreproducibilityincomputational

biology:Thecaseofthetuberculosisdrugome.PLoSOne.2013;8(11).doi:10.1371/journal.pone.0080278.

8. KinningsSL,XieLL,FungKH,JacksonRM,XieLL,BournePE.TheMycobacteriumtuberculosisdrugomeanditspolypharmacologicalimplications.PLoSComputBiol.2010;6(11).doi:10.1371/journal.pcbi.1000976.

9. BellAW,DeutschEW,AuCE,etal.AHUPOtestsamplestudyrevealscommonproblemsinmassspectrometry-basedproteomics.NatMethods.2009;6(6):423-430.doi:10.1038/nmeth.1333.

10. IoannidisJPA,AllisonDB,BallCA,etal.Repeatabilityofpublishedmicroarraygeneexpressionanalyses.NatGenet.2009;41(2):149-155.doi:10.1038/ng.295.

11. HothornT,LeischF.Casestudiesinreproducibility.BriefBioinform.2011;12(3):288-300.doi:10.1093/bib/bbq084.

12. GrovesT,GodleeF.Openscienceandreproducibleresearch.BMJ.2012;344.doi:10.1136/bmj.e4383.

13. ThinkLab.https://thinklab.com/.AccessedJanuary1,2016.14. BoettigerC.AnintroductiontoDockerforreproducibleresearch,with

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 14: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

examplesfromtheRenvironment.ACMSIGOPSOperSystRevSpecIssueRepeatabilitySharExpArtifacts.2015;49(1):71-79.doi:10.1145/2723872.2723882.

15. DaiM,WangP,BoydAD,etal.Evolvinggene/transcriptdefinitionssignificantlyaltertheinterpretationofGeneChipdata.NucleicAcidsRes.2005;33(20):e175.doi:10.1093/nar/gni179.

16. KopljarI,GallacherDJ,DeBondtA,etal.FunctionalandTranscriptionalCharacterizationofHistoneDeacetylaseInhibitor-MediatedCardiacAdverseEffectsinHumanInducedPluripotentStemCell-DerivedCardiomyocytes.StemCellsTranslMed.2016;5(5):602-612.doi:10.5966/sctm.2015-0279.

17. KarpińskiP,FrydeckaD,SąsiadekM.Reducednumberofperipheralnaturalkillercellsinschizophreniabutnotinbipolardisorder.Brain,Behav.2016.http://www.sciencedirect.com/science/article/pii/S0889159116300265.AccessedMay31,2016.

18. BrummelmanJ,RaevenR,HelmK.TranscriptomesignaturefordampenedTh2dominanceinacellularpertussisvaccine-inducedCD4+TcellresponsesthroughTLR4ligation.Scientific.2016.http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4846868/.AccessedMay31,2016.

19. BilgrauA,EriksenP,RasmussenJ.GMCM:Unsupervisedclusteringandmeta-analysisusinggaussianmixturecopulamodels.JStat.2016.https://www.jstatsoft.org/article/view/v070i02/v70i02.pdf.AccessedMay31,2016.

20. GandinV,MasvidalL,CargnelloM,GyenisL.mTORC1andCK2coordinateternaryandeIF4Fcomplexassembly.Nature.2016.http://www.nature.com/ncomms/2016/160404/ncomms11127/full/ncomms11127.html.AccessedMay31,2016.

21. KilleenA,DiskinM,MorrisD.Endometrialgeneexpressioninhigh-andlow-fertilityheifersinthelatelutealphaseoftheestrouscycleandacomparisonwithmidlutealgeneexpression.Physiological.2016.http://physiolgenomics.physiology.org/content/48/4/306.abstract.AccessedMay31,2016.

22. CollettiN,LiuH,GowerA,AlekseyevY.Tlr3signalingPromotestheinductionofUniquehumanBDca-3DendriticcellPopulations.Front.2016.http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4789364/.AccessedMay31,2016.

23. LeeM,HuangR,TongW.Discoveryoftranscriptionaltargetsregulatedbynuclearreceptorsusingaprobabilisticgraphicalmodel.ToxicolSci.2015.http://toxsci.oxfordjournals.org/content/early/2015/12/07/toxsci.kfv261.abstract.AccessedMay31,2016.

24. TroyN,HollamsE,HoltP.Differentialgenenetworkanalysisfortheidentificationofasthma-associatedtherapeutictargetsinallergen-specificT-helpermemoryresponses.BMCMed.2016.http://bmcmedgenomics.biomedcentral.com/articles/10.1186/s12920-016-0171-z.AccessedMay31,2016.

25. ManiéE,PopovaT,BattistellaA.Genomichallmarksofhomologous

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 15: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

recombinationdeficiencyininvasivebreastcarcinomas.JCancer.2016.http://onlinelibrary.wiley.com/doi/10.1002/ijc.29829/full.AccessedMay31,2016.

26. DekkersB,HeH,HansonJ,WillemsL.TheArabidopsisDELAYOFGERMINATION1geneaffectsABSCISICACIDINSENSITIVE5(ABI5)expressionandgeneticallyinteractswithABI3duringArabidopsis.ThePlant.2016.http://onlinelibrary.wiley.com/doi/10.1111/tpj.13118/full.AccessedMay31,2016.

27. HoltP,StricklandD,BoscoA,BelgraveD.DistinguishingbenignfrompathologicTH2immunityinatopicchildren.JAllergy.2015.http://www.sciencedirect.com/science/article/pii/S0091674915013342.AccessedMay31,2016.

28. LückS,WestermarkP.CircadianmRNAexpression:insightsfrommodelingandtranscriptomics.CellMolLifeSci.2016.http://link.springer.com/article/10.1007/s00018-015-2072-2.AccessedMay31,2016.

29. BoscoA,WiehlerS,ProudD.Interferonregulatoryfactor7regulatesairwayepithelialcellresponsestohumanrhinovirusinfection.BMCGenomics.2016.http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2405-z.AccessedMay31,2016.

30. FauteuxF,HillJ,JaramilloM,PanY,PhanS.Computationalselectionofantibody-drugconjugatetargetsforbreastcancer.Oncotarget.2015.http://europepmc.org/abstract/med/26700623.AccessedMay31,2016.

31. NapolitanoF,SirciF,CarrellaD,BernardoDdi.Drug-setenrichmentanalysis:anoveltooltoinvestigatedrugmodeofaction.Bioinformatics.2016.http://bioinformatics.oxfordjournals.org/content/32/2/235.short.AccessedMay31,2016.

32. CarrollJ,MeyerC,SongJ,LiW,GeistlingerT.Genome-wideanalysisofestrogenreceptorbindingsites.Nature.2006.http://www.nature.com/ng/journal/v38/n11/abs/ng1901.html.AccessedMay31,2016.

33. LupienM,EeckhouteJ,MeyerC,WangQ,ZhangY.FoxA1translatesepigeneticsignaturesintoenhancer-drivenlineage-specifictranscription.Cell.2008.http://www.sciencedirect.com/science/article/pii/S0092867408001189.AccessedMay31,2016.

34. WangQ,LiW,ZhangY,etal.Androgenreceptorregulatesadistincttranscriptionprograminandrogen-independentprostatecancer.Cell.2009.http://www.sciencedirect.com/science/article/pii/S0092867409005170.AccessedMay31,2016.

35. LefterovaM,ZhangY,StegerD.PPARγandC/EBPfactorsorchestrateadipocytebiologyviaadjacentbindingonagenome-widescale.Genes&.2008.http://genesdev.cshlp.org/content/22/21/2941.short.AccessedMay31,2016.

36. TuupanenS,TurunenM,LehtonenR,HallikasO.ThecommoncolorectalcancerpredispositionSNPrs6983267atchromosome8q24conferspotentialtoenhancedWntsignaling.Nature.2009.

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 16: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

http://www.nature.com/ng/journal/v41/n8/abs/ng.406.html.AccessedMay31,2016.

37. ObadS,SantosCdos,PetriA,HeidenbladM.SilencingofmicroRNAfamiliesbyseed-targetingtinyLNAs.Nature.2011.http://www.nature.com/ng/journal/v43/n4/abs/ng.786.html.AccessedMay31,2016.

38. HeH,MeyerC,ShinH,BaileyS,WeiG,WangQ.Nucleosomedynamicsdefinetranscriptionalenhancers.Nature.2010.http://www.nature.com/ng/journal/v42/n4/abs/ng.545.html.AccessedMay31,2016.

39. OzsolakF,SongJ,LiuX,FisherD.High-throughputmappingofthechromatinstructureofhumanpromoters.NatBiotechnol.2007.http://www.nature.com/nbt/journal/v25/n2/abs/nbt1279.html.AccessedMay31,2016.

40. ZuoT,WangL,MorrisonC,ChangX,ZhangH,LiW.FOXP3isanX-linkedbreastcancersuppressorgeneandanimportantrepressoroftheHER-2/ErbB2oncogene.Cell.2007.http://www.sciencedirect.com/science/article/pii/S0092867407005454.AccessedMay31,2016.

41. EnardW,GehreS,HammerschmidtK,HölterS.AhumanizedversionofFoxp2affectscortico-basalgangliacircuitsinmice.Cell.2009.http://www.sciencedirect.com/science/article/pii/S009286740900378X.AccessedMay31,2016.

42. NunezM,Sanchez-JimenezC,AlcaldeJ,IzquierdoJM.Long-termreductionofT-cellintracellularantigensrevealsatranscriptomeassociatedwithextracellularmatrixandcelladhesioncomponents.PLoSOne.2014;9(11).doi:10.1371/journal.pone.0113141.

43. Beaulieu-JonesB,GreeneC.ContinuousAnalysisBrainArray:SubmissionReleaseContinuousAnalysisBrainArray:SubmissionRelease.August2016.doi:10.5281/zenodo.59892.

44. DuvallP,MatyasS,GloverA.ContinuousIntegration:ImprovingSoftwareQualityandReducingRisk.;2007.http://portal.acm.org/citation.cfm?id=1406212.

45. Docker.Docker.https://www.docker.com.46. Beaulieu-JonesBK,GreeneCS.ContinuousAnalysisExampleDockerImages.

2016.10.6084/m9.figshare.3545156.v1.47. Beaulieu-JonesBK,GreeneCS.ContinuousAnalysis.GitHubrepository.

https://github.com/greenelab/continuous_analysis.Published2016.48. Drone.io.https://drone.io/.49. Beaulieu-JonesBK.DenoisingAutoencodersforPhenotypeStratification

(DAPS):PreprintRelease.Zenodo.January2016.doi:10.5281/zenodo.46165.50. KatohK,MisawaK,KumaK,MiyataT.MAFFT:anovelmethodforrapid

multiplesequencealignmentbasedonfastFouriertransform.NucleicAcidsRes.2002;30(14):3059-3066.doi:10.1093/nar/gkf436.

51. RiceP,LongdenI,BleasbyA,etal.EMBOSS:theEuropeanMolecularBiologyOpenSoftwareSuite.TrendsGenet.2000;16(6):276-277.doi:10.1016/s0168-

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 17: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

9525(00)02024-2.52. BojSF,HwangC-I,BakerLA,etal.OrganoidModelsofHumanandMouse

DuctalPancreaticCancer.Cell.2015;160(1):324-338.doi:10.1016/j.cell.2014.12.021.

53. BalliD.UsingKallistoforexpressionanalysisofpublishedRNAseqdata.https://benchtobioinformatics.wordpress.com/2015/07/10/using-kallisto-for-gene-expression-analysis-of-published-rnaseq-data/.Published2015.AccessedAugust1,2016.

54. BrayNL,PimentelH,MelstedP,PachterL.Near-optimalprobabilisticRNA-seqquantification.NatBiotechnol.2016;34(5):525-527.doi:10.1038/nbt.3519.

55. RitchieME,PhipsonB,WuD,etal.limmapowersdifferentialexpressionanalysesforRNA-sequencingandmicroarraystudies.NucleicAcidsRes.2015;43(7):e47.doi:10.1093/nar/gkv007.

56. SmythGK.Linearmodelsandempiricalbayesmethodsforassessingdifferentialexpressioninmicroarrayexperiments.StatApplGenetMolBiol.2004;3:Article3.doi:10.2202/1544-6115.1027.

57. PimentelHJ,BrayN,PuenteS,MelstedP,PachterL.DifferentialanalysisofRNA-Seqincorporatingquantificationuncertainty.bioRxiv.2016.doi:10.1101/058164.

58. PérezF,GrangerBE.{IP}ython:aSystemforInteractiveScientificComputing.ComputSciEng.2007;9(3):21-29.doi:10.1109/MCSE.2007.53.

59. Jupyter.http://jupyter.org/.Published2016.AccessedJanuary8,2016.60. RStudio.RStudio:IntegrateddevelopmentenvironmentforR(Version

0.97.311).JWildlManage.2011;75(8):1753-1766.doi:10.1002/jwmg.232.61. BaumerB,Cetinkaya-RundelM,BrayA,LoiL,HortonNJ.RMarkdown:

IntegratingAReproducibleAnalysisToolintoIntroductoryStatistics.TechnolInnovStatEduc.2014;8(1):20.doi:10.5811/westjem.2011.5.6700.

62. FriedrichLeisch.Sweave:Dynamicgenerationofstatisticalreportsusingliteratedataanalysis.Compstat2002-ProcComputStat.2002;(69):575-580.doi:10.1.1.20.2737.

63. Beaulieu-JonesBK,GreeneCS.Semi-SupervisedLearningoftheElectronicHealthRecordwithDenoisingAutoencodersforPhenotypeStratification.bioRxiv.February2016.http://biorxiv.org/content/early/2016/02/18/039800.abstract.

64. SouilmiY,LancasterAK,JungJ-Y,etal.Scalableandcost-effectiveNGSgenotypinginthecloud.BMCMedGenomics.2015;8(1):64.doi:10.1186/s12920-015-0134-9.

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint