Page 1
�1
TowardEnablingReproducibilityforData-IntensiveResearchusingtheWholeTalePlatform
VictoriaStoddenAssociateProfessor,SchoolofInformationSciences
UniversityofIllinoisatUrbana-Champaign
ParCoSymposiumReproducibilityinData-IntensiveComputing
Prague,CZSeptember10,2019
Page 2
Agenda
1. InfrastructureContributions:TheWholeTaleProject
2.ExtendingWholeTaletoEnable“TalesatScale”
3. InfrastructureChallenges
�2
Page 3
ParsingReproducibility
● EmpiricalReproducibility:○ traditionalempiricalexperiments,e.g.atthebench/lab
● StatisticalReproducibility:○ statisticalmethodologyusedpermitsgeneralizabilityofdatainferences
● ComputationalReproducibility:○ transparencyofcomputationalstepsthatproducescientificfindings
V.Stodden.(2013).ResolvingIrreproducibilityinEmpiricalandComputationalResearch.IMSBulletin
Page 4
WholeTale:MergingScience&CyberinfrastructurePathways
�4
Page 5
WholeTaleCollaboration(PITeam)● UIllinois(NCSA)BertramLudäscher,VictoriaStodden,MattTurk
○ overalllead(co-operativeagreement)○ reproducibility;provenance;opensourcesoftwaredevelopment;outreach
● UChicago(Globus)KyleChard○ datatransfer&storage;compute;infrastructure
● UCSantaBarbara(NCEAS)MattJones○ (meta-)datapublishing;provenance;repositories
● UTexas,Austin(TACC)NiallGaffney○ compute;HTC;“bigtale”;ScienceGateways
● UNotreDame(CRC)JarekNabrzyski○ UXdesign;UIdesign
�5
Page 6
SimplifyingComputationalReproducibilityinWholeTale● Researcherscaneasilypackageandsharetales:
○ Data,Code,andComputeEnvironment■ includingnarrativeandworkflowinformationincludinginputs,outputs,andintermediates
○ tore-createthecomputationalresultsfromascientificstudy○ achievingcomputationalreproducibility○ thus“settingthedefaulttoreproducible.”
● Alsoempowersuserstoverifyandextendresultswithdifferentdata,methods,andenvironments.
�6
V.Stodden,D.H.Bailey,J.Borwein,R.J.LeVeque,W.Rider,andW.Stein.SettingtheDefaulttoReproducible:ReproducibilityinComputationalandExperimentalMathematics,ICERMWorkshop2013.
Page 7
WholeTale:What’sinaname…ADoubleEntendre:
○ Wholetale:capturestheend-to-endscientificdiscoverystory,includingcomputationalaspects
○ Longtail:includesallcomputationalresearch,e.g.bespokeorsmallscaleresearch
AddressesProblemsscientistsface:○ Reproducibility(andreuse)challengesincomputational&data-enabled
research(e.g.data+codeaccess,dependencyhell,…)WholeTaleApproach:
○ directlyrespondtocommunityneedsandrequirements
�7
Page 8
TheNeedforaPlatformforReproducibleResearch● Enableresearchersto(easily)managethecompleteconductofa
computationalexperimentandpermititsexposureasapublishable“Tale”
● Addressthetwotrendssimultaneously:○ improvedtransparencysoresearcherscanrunmuchmoreambitious
computationalexperiments.○ andbettercomputationalexperimentinfrastructurewillallowresearchersto
bemoretransparent.
D.DonohoandV.Stodden.(2015).ReproducibleResearchintheMathematicalSciences.ThePrincetonCompaniontoAppliedMathematics,Ed.N.J.Higham.
Page 9
SowhatisWholeTale?● Aweb-based,opensourceplatformforreproducibleresearchforthe
creation,publication,andexecutionoftales:executableresearchobjectsthatcapturedata,code,anddetailsofthecomputingenvironmentusedtoproduceresearchfindings
● DrivenbyCommunityEngagement:○ Workinggroups,internships,collaborations,etc.
● EnhancesEducation&Training:○ Trainingforreproducibility;useofWholeTaleintheclassroom
Page 10
WTSoftwareDevelopment● Open-SourceDevelopmentModel
○ across5collaborativesites● Allsourceisopen:
○ https://github.com/whole-tale/● Developersareexpectedtofollowthe:
○ Developer'sguide● Opencommunicationvia:
○ weeklycallswithpublicmeetingnotes● Softwarereleasesfollowa:
○ Developmentplan
�10
Development
Workshops & Working Groups
Page 11
Whatexactlyis(in)aTale?● Tale=executableresearchobject,i.e.
○ data(references)○ +code(computationalmethods)○ +narrative(traditionalsciencestory)○ +computeenvironment(e.g.RStudio,Jupyter)
● Capturedinastandards-basedtaleformatcompletewithmetadata
�11
Code/Narrative
Computeenvironment
Data
Page 12
�12
BrowseExistingTales…
Page 13
�13
…ComposeNewTales…
Page 14
�14
…Run&InteractwithTales
…
Page 15
�15
…UseTaleMetadata
…
Page 16
�16
…IntegrateDataReposwithWholeTale!
● Enablesturnkeyexploratorydataanalysisonexistingpublisheddatasets
● DataONEandDataversenetworkscover>90majorresearchrepositories!
Page 17
InputData
ResearchQuestion Analysis Output
Data Narrative PublishedArticle
Verify/Reproduce/Re-use
Accelerate
AcceleratingReproducibleOpenScience
Page 18
�18
WholeTalePlatformOverview
Research&QuantitativeComputationalEnvironments
ExternalDataSources
Code+Narrative
● Authenticateusingyourinstitutionalidentity● Accesscommonly-usedcomputationalenvironments● Easilycustomizeyourenvironment● Referenceandaccessexternallyregistereddata
● Createoruploadyourdataandcode● Addmetadata(includingprovenanceinformation)● Submitcode,data,andenvironmenttoarchivalrepository● Getapersistentidentifier● Shareforverificationandre-use
PublishTale
CreatetaleAnalyzedata
Page 19
Whoseproblemsareweaddressing?● Researchers,scientists,othersmaybe
○ creatorsoftalese.g.shareyourfindingsinatale
○ reviewersofarticlescanreviewtalese.g.reproducenewscientificclaims
○ (re-)usersoftalese.g.builduponprogressofothers
�19
Page 20
ExtendingWTtoData-IntensiveResearch● Motivatingscenario:TheRenaissanceSimulationsLaboratoryprovidesaccesstoover70
TBofrawdataandderiveddataproducts.RSLexposesdataavailableonsystemsattheSanDiegoSupercomputingCenterviaJupyterweb-basedinteractiveenvironments.
● Relevantfeatures:
1.theRSdataislarge,impracticaltotransfer,requireslarge-scaleresourcestoanalyze.2.theresearchcommunityleveragesJupyterinteractiveenvironmentsforboth
exploratoryandprimaryanalyticalworkwithsomeanalysisrequiringbatchcomputeresources.
3.thecommunityisinterestedinsharingresultingresearchartifacts(e.g.,code,deriveddata)forbothre-executionandre-use.
�20
Page 21
ExtendingWTtoData-IntensiveResearch
�21
Tale frontend and HPC workloads on WT deployment cluster: Users can launch local HPC jobs using standard system calls
Tale Frontend on single HPC Compute Node: running the Tale frontend (Jupyter/R-studio notebooks) on compute nodes in an HPC cluster, which launch independent HPC jobs using standard system calls.
Page 22
ExtendingWTtoData-IntensiveResearch
�22
Tale frontend on HPC compute node with local LRM (cluster queuing system) access: Allows submission of HPC jobs to the queuing system of the cluster.
Tale frontend on HPC compute nodes with MPI: launch the Tale frontend as an MPI job. The cluster LRM (queuing system) allocates the number of nodes requested at the submission of the Tale frontend job and sets the appropriate MPI environment. The Tale frontend would run on the lead node allocated to the MPI job by the LRM and would launch MPI subjobs on the nodes allocated to the MPI job.
Page 23
ExtendingWTtoData-IntensiveResearch
�23
Tale frontend on WT cluster with remote LRM access: Tale frontends run alongside WT services, but HPC jobs can be submitted to remote clusters via the middleware.
Decoupled Tale frontend with LRM Remote Access: Tale frontends run on various resources and HPC jobs can run on any resources supported by the middleware. Users could bypass the limitations present in the default resources provided by the WT infrastructure e.g. a user with cloud access could request that a Tale be run on cloud resources under the user’s account.
Page 24
ChallengestoExtendingWT● TheneedtomaintainresponsivenessofTalefrontends
● DependenceonMiddleware:ScalabilityandLongevity
● ManagingHPCnetworkrestrictions
● Talefrontendshoweverrequireincomingnetworkconnectionsinordertoexposetheiruserinterface.Consequently,ageneralsolutioninvolvingTalefrontendsoncomputenodesrequiressomeformofproxyingofconnectionsfromtheWholeTaleclustertoHPCclustercomputenodes.Restrictionsonincomingnetworkconnectionsmaylikelybearesultoflocalsecuritypoliciesandthereforeproxying,evenifauthenticated,maybeseenasanunwelcomecircumventionofsuchpolicies.
● ContainerizationandHPCworkloadse.g.adependenceonspecifichardwarewhichcanaffecttheabilityforthecodetobere-runifthespecifichardwarebecomesunavailable
● Dataaccessandquasi-locality:IfTalefrontendsand/orHPCworkloadsrunonHPCresourcesonwhichcopiesofdataarealreadyavailable,theWTimplementationisbeinefficientsinceeachfilewouldbetransferredoncetoWTresourcesandonceforeachTalefrontendinstancethataccessesthefile
�24
Page 25
Conclusion WholeTaleofferspotentialforenablingreproducibilityforData-Intensive
computing,butisnotwithoutchallengesrequiringinnovationinthesoftwarearchitectureandinfrastructureimplementation.
However,reproducibilityisnowrecognizedasapressingissueofwhichcomputationalinfrastructureisonekeypart.
Infrastructuresupportingtransparencyandreproducibilitywillbeusednotoutofhygieneorasabestpractice,butbecauseitenablesincreasinglyambitiouscomputationalresearch.
�25